A social RSS reader built on the AT Protocol. glean.at
glean atproto atmosphere rss feed social app
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

Remove full content for article embedding and truncate

+43 -26
+20 -20
docs/specs.md
··· 10 10 11 11 ## 2. Stack 12 12 13 - | Layer | Technology | 14 - | ---------------- | --------------------------------------------------------------- | 15 - | Backend | Go | 16 - | Database | SQLite (3 files: users, articles, recs via `mattn/go-sqlite3` + `sqlite-vec` for vector search) | 17 - | Frontend | htmx + TailwindCSS | 18 - | Auth | AT Protocol OAuth / DID resolution (configurable PLC directory) | 19 - | AT Protocol role | AppView for `at.glean.*` lexicons | 20 - | Data source | AT Protocol Jetstream → SQLite index | 13 + | Layer | Technology | 14 + | ---------------- | ----------------------------------------------------------------------------------------------- | 15 + | Backend | Go | 16 + | Database | SQLite (3 files: users, articles, recs via `mattn/go-sqlite3` + `sqlite-vec` for vector search) | 17 + | Frontend | htmx + TailwindCSS | 18 + | Auth | AT Protocol OAuth / DID resolution (configurable PLC directory) | 19 + | AT Protocol role | AppView for `at.glean.*` lexicons | 20 + | Data source | AT Protocol Jetstream → SQLite index | 21 21 22 22 ## 3. AT Protocol Lexicons 23 23 ··· 645 645 646 646 ### 7.1 Signals 647 647 648 - | Signal | Source | Weight (default) | Description | 649 - | ------------ | ------------------------ | ---------------- | -------------------------------------------------- | 650 - | Subscription | `subscriptions` | 1.0 | Jaccard over subscriber sets between similar users | 651 - | Like | `likes` | 0.5 | Time-decayed like co-occurrence (30-day half-life) | 652 - | Tag | `annotations.tags` | 0.3 | Jaccard over annotation tag sets | 653 - | Social | `follow_distances` | 0.7 | Follow distance: 1-hop=1.0, 2-hop=0.3, 3-hop=0.1 | 654 - | Popularity | `feeds.subscriber_count` | 0.2 | `log(1 + subscribers) / log(1 + max)` | 655 - | Category | `subscriptions.category` | 0.4 | Boost feeds matching user's existing categories | 648 + | Signal | Source | Weight (default) | Description | 649 + | ------------ | ------------------------ | ---------------- | ------------------------------------------------------- | 650 + | Subscription | `subscriptions` | 1.0 | Jaccard over subscriber sets between similar users | 651 + | Like | `likes` | 0.5 | Time-decayed like co-occurrence (30-day half-life) | 652 + | Tag | `annotations.tags` | 0.3 | Jaccard over annotation tag sets | 653 + | Social | `follow_distances` | 0.7 | Follow distance: 1-hop=1.0, 2-hop=0.3, 3-hop=0.1 | 654 + | Popularity | `feeds.subscriber_count` | 0.2 | `log(1 + subscribers) / log(1 + max)` | 655 + | Category | `subscriptions.category` | 0.4 | Boost feeds matching user's existing categories | 656 656 | Content | `article_embeddings` | 0.4 | Cosine similarity via embedding KNN (requires embedder) | 657 657 658 658 ### 7.2 Feed Co-occurrence (Jaccard Similarity) ··· 759 759 760 760 A background goroutine runs on a configurable schedule (`GLEAN_CLUSTER_INTERVAL`, default 10m): 761 761 762 - 1. **Compute feed embeddings**: Embed new feed descriptions via OpenAI-compatible API into `feed_embeddings` table (skipped if no embedder configured) 762 + 1. **Compute feed embeddings**: Embed new feed descriptions via embedding API into `feed_embeddings` table (skipped if no embedder configured) 763 763 2. **Compute feed similarity**: Batch-update `feed_similarity` table (Jaccard over subscriber sets + embedding cosine similarity) 764 764 3. **Compute user similarity**: Batch-update `user_similarity` table (subscription Jaccard + time-decayed likes + tags + follow boost) 765 - 4. **Compute article embeddings**: Embed new articles' full content (`title + summary + full_content + content`) via OpenAI-compatible API into `article_embeddings` vec0 table (skipped if no embedder configured) 765 + 4. **Compute article embeddings**: Embed new articles (`title + summary + content`, excluding `full_content` to stay within embedding model token limits) via embedding API into `article_embeddings` vec0 table (skipped if no embedder configured) 766 766 5. **Compute follow distances**: Incremental BFS for dirty users (1-hop through 3-hop from `follows` table) 767 767 6. **Compute signal profiles**: Per-user category/tag/like summaries 768 768 7. **Auto-dismiss stale**: Dismiss items shown >=5 times over >5 days without action ··· 854 854 855 855 When `GLEAN_EMBED_BASE_URL` is configured, article text and feed descriptions are embedded into vectors stored in `sqlite-vec` virtual tables (`recs.feed_embeddings`, `recs.article_embeddings`). The vec0 extension provides native KNN vector search via `WHERE embedding MATCH ? AND k = ?`, replacing Go-side cosine similarity for large-scale lookups. Without embeddings, recommendations rely only on subscription overlap, like patterns, and social graph — no content-based signals. 856 856 857 - The embedder uses the official `github.com/openai/openai-go` SDK with `option.WithBaseURL()`, so any OpenAI-compatible `/v1/embeddings` endpoint works (OpenAI, Ollama, local inference servers). 857 + The embedder uses the official `github.com/openai/openai-go` SDK with `option.WithBaseURL()`, so any OpenAI-compatible `/v1/embeddings` endpoint works (OpenAI, Gemini, Ollama, local inference servers). 858 858 859 859 vec0 tables are created dynamically at startup with the configured dimension (`GLEAN_EMBED_DIMENSION`, default 1536): 860 860 ··· 879 879 ); 880 880 ``` 881 881 882 - During cron, `ComputeArticleEmbeddings` embeds new articles in batches (using `title + summary + full_content + content` for maximum semantic coverage) and inserts them into the vec0 table. `ComputeFeedEmbeddings` embeds feed descriptions (`title || description`) and re-embeds when the source text changes (detected via `feed_embedding_meta`). During on-demand article recommendations, the user's liked article embeddings are averaged into an interest vector, then a vec0 KNN query finds the top-200 most semantically similar articles. For cold-start users (<5 subscriptions), their subscribed feed embeddings are averaged and a KNN query finds similar feeds. 882 + During cron, `ComputeArticleEmbeddings` embeds new articles in batches using `title + summary + content` (the scraped `full_content` is excluded to stay within model token limits — most embedding models cap at ~8k tokens). Text is truncated to 8000 characters as a safety net. Batches are capped at 100 inputs per API call. `ComputeFeedEmbeddings` embeds feed descriptions (`title || description`) and re-embeds when the source text changes (detected via `feed_embedding_meta`). During on-demand article recommendations, the user's liked article embeddings are averaged into an interest vector, then a vec0 KNN query finds the top-200 most semantically similar articles. For cold-start users (<5 subscriptions), their subscribed feed embeddings are averaged and a KNN query finds similar feeds. 883 883 884 884 ## 8. HTTP API / htmx Endpoints 885 885
+23 -6
internal/cluster/article.go
··· 10 10 vec "github.com/asg017/sqlite-vec-go-bindings/cgo" 11 11 ) 12 12 13 + // embedBatchSize caps how many texts are sent in a single embedding API call. 14 + // Most /v1/embeddings endpoints accept up to ~2048 inputs per request; lower 15 + // values reduce payload size and memory pressure. 13 16 const embedBatchSize = 100 14 17 15 - // ComputeArticleEmbeddings embeds new articles (title + summary) into the 16 - // article_embeddings vec0 table. Skipped when no embedder is configured. 17 - // Existing embeddings for deleted articles are cleaned up. Articles already 18 - // embedded are not re-embedded. 18 + // maxEmbedChars truncates text sent to the embedding model. Most models have 19 + // a token context window (~4 chars/token); title+summary+content stays well 20 + // under this, but the cap is kept as a safety net. 21 + const maxEmbedChars = 8000 22 + 23 + func truncateForEmbed(s string) string { 24 + if len(s) <= maxEmbedChars { 25 + return s 26 + } 27 + return s[:maxEmbedChars] 28 + } 29 + 30 + // ComputeArticleEmbeddings embeds new articles (title + summary + content) into 31 + // the article_embeddings vec0 table. full_content (scraped body) is excluded 32 + // because embedding models have token context limits and the feed-provided 33 + // content already captures the topical signal needed for recommendation KNN. 34 + 19 35 func (e *Engine) ComputeArticleEmbeddings(ctx context.Context) error { 20 36 if e.embedder == nil { 21 37 e.logger.Debug("article embeddings skipped, no embedder") ··· 37 53 } 38 54 39 55 rows, err := conn.QueryContext(ctx, ` 40 - SELECT a.id, COALESCE(a.title, '') || ' ' || COALESCE(a.summary, '') || ' ' || COALESCE(a.full_content, '') || ' ' || COALESCE(a.content, '') 56 + SELECT a.id, COALESCE(a.title, '') || ' ' || COALESCE(a.summary, '') || ' ' || COALESCE(a.content, '') 41 57 FROM articles.articles a 42 - WHERE (COALESCE(a.title, '') != '' OR COALESCE(a.summary, '') != '' OR COALESCE(a.full_content, '') != '' OR COALESCE(a.content, '') != '') 58 + WHERE (COALESCE(a.title, '') != '' OR COALESCE(a.summary, '') != '' OR COALESCE(a.content, '') != '') 43 59 AND a.id NOT IN (SELECT article_id FROM recs.article_embeddings) 44 60 ORDER BY a.id 45 61 `) ··· 58 74 rows.Close() 59 75 return err 60 76 } 77 + a.text = truncateForEmbed(a.text) 61 78 batch = append(batch, a) 62 79 } 63 80 rows.Close()