A social RSS reader built on the AT Protocol. glean.at
glean atproto atmosphere rss feed social app
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

Overhaul recommendation system and improve performance

+1888 -516
+7
.env.example
··· 11 11 # Leave empty for localhost OAuth (development) 12 12 # GLEAN_OAUTH_CLIENT_ID=https://glean.at/oauth/client-metadata 13 13 # GLEAN_OAUTH_REDIRECT_URL=https://glean.at/auth/callback 14 + # Embeddings (recommended — powers content-based feed/article recommendations) 15 + # Point to any OpenAI-compatible /v1/embeddings endpoint (OpenAI, Ollama, etc.) 16 + # Without embeddings, recommendations rely only on subscription overlap and social graph. 17 + GLEAN_EMBED_BASE_URL=https://api.openai.com/v1 18 + GLEAN_EMBED_API_KEY=sk-... 19 + GLEAN_EMBED_MODEL=text-embedding-3-small 20 + GLEAN_EMBED_DIMENSION=1536
+1
Dockerfile
··· 13 13 RUN npx tailwindcss -i ./static/input.css -o ./static/output.css --minify 14 14 15 15 RUN --mount=type=cache,target=/root/.cache/go-build \ 16 + CGO_CFLAGS="-I/src/internal/db/include -I$(go env GOMODCACHE)/github.com/mattn/go-sqlite3@$(grep 'mattn/go-sqlite3' go.mod | awk '{print $2}')" \ 16 17 CGO_ENABLED=1 go build -tags fts5 -ldflags="-s -w" -o /glean . 17 18 18 19 FROM alpine:3.21
+4
Makefile
··· 1 + SQLITE3_VER := $(shell grep 'mattn/go-sqlite3' go.mod | awk '{print $$2}') 2 + SQLITE3_INC := $(shell go env GOMODCACHE)/github.com/mattn/go-sqlite3@$(SQLITE3_VER) 3 + export CGO_CFLAGS := -I$(CURDIR)/internal/db/include -I$(SQLITE3_INC) 4 + 1 5 .PHONY: tools-install 2 6 tools-install: 3 7 go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest
+95 -23
docs/specs.md
··· 13 13 | Layer | Technology | 14 14 | ---------------- | --------------------------------------------------------------- | 15 15 | Backend | Go | 16 - | Database | SQLite (3 files: users, articles, recs via `mattn/go-sqlite3`) | 16 + | Database | SQLite (3 files: users, articles, recs via `mattn/go-sqlite3` + `sqlite-vec` for vector search) | 17 17 | Frontend | htmx + TailwindCSS | 18 18 | Auth | AT Protocol OAuth / DID resolution (configurable PLC directory) | 19 19 | AT Protocol role | AppView for `at.glean.*` lexicons | ··· 460 460 461 461 ```sql 462 462 CREATE TABLE users ( 463 - did TEXT PRIMARY KEY, 464 - indexed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 465 - updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP 463 + did TEXT PRIMARY KEY, 464 + indexed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 465 + updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 466 + follows_dirty BOOLEAN NOT NULL DEFAULT 1 466 467 ); 467 468 ``` 468 469 ··· 649 650 | Subscription | `subscriptions` | 1.0 | Jaccard over subscriber sets between similar users | 650 651 | Like | `likes` | 0.5 | Time-decayed like co-occurrence (30-day half-life) | 651 652 | Tag | `annotations.tags` | 0.3 | Jaccard over annotation tag sets | 652 - | Social | `follow_distances` | 0.7 | Follow distance: 1-hop=1.0, 2-hop=0.3 | 653 + | Social | `follow_distances` | 0.7 | Follow distance: 1-hop=1.0, 2-hop=0.3, 3-hop=0.1 | 653 654 | Popularity | `feeds.subscriber_count` | 0.2 | `log(1 + subscribers) / log(1 + max)` | 654 655 | Category | `subscriptions.category` | 0.4 | Boost feeds matching user's existing categories | 656 + | Content | `article_embeddings` | 0.4 | Cosine similarity via embedding KNN (requires embedder) | 655 657 656 658 ### 7.2 Feed Co-occurrence (Jaccard Similarity) 657 659 ··· 661 663 J(A, B) = |subscribers(A) ∩ subscribers(B)| / |subscribers(A) ∪ subscribers(B)| 662 664 ``` 663 665 664 - Feed description text similarity is also computed (word overlap after stopword removal) and added as a boost. 666 + Feed description similarity is also computed via embedding cosine similarity (requires embedder) and added as a boost. 665 667 666 668 ### 7.3 User Similarity 667 669 ··· 700 702 ``` 701 703 score = like_signal * w_like 702 704 + social_signal * w_social 705 + + content_signal * w_content 703 706 + recency_signal * 0.2 704 707 ``` 705 708 709 + Content signal uses embedding vectors: the user's liked article embeddings are averaged into a single interest vector, then a KNN query against the `article_embeddings` vec0 table finds semantically similar articles. This requires an embedder to be configured; without it, the content signal is 0. 710 + 706 711 ### 7.5 User Feedback (Dismiss) 707 712 708 713 Users can dismiss recommendations they don't want to see again: ··· 711 716 - `POST /articles/dismiss` — dismiss an article recommendation 712 717 - Dismissals are stored locally in `dismissed_recommendations` (not on PDS) 713 718 - Dismissed items are excluded from all future recommendation queries 714 - - Auto-dismiss: items shown >15 times over >30 days without action are auto-dismissed 719 + - Auto-dismiss: items shown ≥5 times over >5 days without action are auto-dismissed 715 720 716 721 Impression tracking (`recommendation_impressions`) records how many times each recommendation was shown and whether the user acted on it. 717 722 ··· 729 734 730 735 ### 7.7 Social Graph 731 736 732 - Follow distances (1-hop and 2-hop) are pre-computed in `follow_distances` during the cron job: 737 + Follow distances (1-hop through 3-hop) are computed incrementally. A `follows_dirty` column on `users` tracks whose follow graph changed since the last cron run. Only dirty users are reprocessed — their existing rows in `follow_distances` are deleted and recomputed via BFS, then the dirty flag is cleared. 733 738 734 739 - 1-hop: direct follows (weight 1.0) 735 740 - 2-hop: friends-of-friends (weight 0.3) 736 - - 3-hop is excluded due to noise and computational cost 741 + - 3-hop: third-degree connections (weight 0.1) 737 742 738 743 ### 7.8 Diversity & Freshness 739 744 ··· 754 759 755 760 A background goroutine runs on a configurable schedule (`GLEAN_CLUSTER_INTERVAL`, default 10m): 756 761 757 - 1. **Compute feed similarity**: Batch-update `feed_similarity` table (Jaccard over subscriber sets + description similarity) 758 - 2. **Compute user similarity**: Batch-update `user_similarity` table (subscription Jaccard + time-decayed likes + tags + follow boost) 759 - 3. **Compute follow distances**: 1-hop and 2-hop from `follows` table 760 - 4. **Compute signal profiles**: Per-user category/tag/like summaries 761 - 5. **Auto-dismiss stale**: Dismiss items shown >15 times over >30 days without action 762 + 1. **Compute feed embeddings**: Embed new feed descriptions via OpenAI-compatible API into `feed_embeddings` table (skipped if no embedder configured) 763 + 2. **Compute feed similarity**: Batch-update `feed_similarity` table (Jaccard over subscriber sets + embedding cosine similarity) 764 + 3. **Compute user similarity**: Batch-update `user_similarity` table (subscription Jaccard + time-decayed likes + tags + follow boost) 765 + 4. **Compute article embeddings**: Embed new articles' full content (`title + summary + full_content + content`) via OpenAI-compatible API into `article_embeddings` vec0 table (skipped if no embedder configured) 766 + 5. **Compute follow distances**: Incremental BFS for dirty users (1-hop through 3-hop from `follows` table) 767 + 6. **Compute signal profiles**: Per-user category/tag/like summaries 768 + 7. **Auto-dismiss stale**: Dismiss items shown >=5 times over >5 days without action 762 769 763 770 Jetstream ingestion and record indexing happen in a separate persistent goroutine (the Jetstream consumer), not in the cron. 764 771 765 - ### 7.11 Recommendation Tables (`<base>_recs`) 772 + ### 7.11 User Interaction Tables (`<base>_users`) 773 + 774 + Per-user interaction state lives in the users database so that real-time writes (impressions, dismissals) never contend with cron batch writes to the recs database. 766 775 767 776 ```sql 768 777 CREATE TABLE dismissed_recommendations ( ··· 784 793 acted BOOLEAN NOT NULL DEFAULT 0, 785 794 PRIMARY KEY (user_did, target_type, target_id) 786 795 ); 796 + ``` 797 + 798 + ### 7.12 Computed Recommendation Tables (`<base>_recs`) 799 + 800 + Written exclusively by the cron. No user-facing writes — only reads during on-demand scoring. 801 + 802 + ```sql 803 + CREATE TABLE feed_similarity ( 804 + feed_a TEXT NOT NULL, 805 + feed_b TEXT NOT NULL, 806 + jaccard REAL NOT NULL, 807 + computed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 808 + PRIMARY KEY (feed_a, feed_b), 809 + CHECK(feed_a < feed_b) 810 + ); 811 + 812 + CREATE TABLE user_similarity ( 813 + user_a TEXT NOT NULL, 814 + user_b TEXT NOT NULL, 815 + jaccard REAL NOT NULL, 816 + common_feeds INTEGER NOT NULL, 817 + common_likes INTEGER NOT NULL DEFAULT 0, 818 + common_tags INTEGER NOT NULL DEFAULT 0, 819 + computed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 820 + PRIMARY KEY (user_a, user_b), 821 + CHECK(user_a < user_b) 822 + ); 787 823 788 824 CREATE TABLE follow_distances ( 789 825 user_a TEXT NOT NULL, 790 826 user_b TEXT NOT NULL, 791 - distance INTEGER NOT NULL CHECK(distance IN (1, 2)), 827 + distance INTEGER NOT NULL CHECK(distance IN (1, 2, 3)), 792 828 PRIMARY KEY (user_a, user_b) 793 829 ); 794 830 ··· 800 836 w_social REAL NOT NULL DEFAULT 0.7, 801 837 w_pop REAL NOT NULL DEFAULT 0.2, 802 838 w_category REAL NOT NULL DEFAULT 0.4, 839 + w_content REAL NOT NULL DEFAULT 0.4, 803 840 updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP 804 841 ); 805 842 ··· 808 845 total_likes INTEGER NOT NULL DEFAULT 0, 809 846 total_tags INTEGER NOT NULL DEFAULT 0, 810 847 top_categories TEXT, 848 + top_tags TEXT, 811 849 updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP 812 850 ); 813 851 ``` 814 852 853 + ### 7.13 Embeddings (recommended) 854 + 855 + When `GLEAN_EMBED_BASE_URL` is configured, article text and feed descriptions are embedded into vectors stored in `sqlite-vec` virtual tables (`recs.feed_embeddings`, `recs.article_embeddings`). The vec0 extension provides native KNN vector search via `WHERE embedding MATCH ? AND k = ?`, replacing Go-side cosine similarity for large-scale lookups. Without embeddings, recommendations rely only on subscription overlap, like patterns, and social graph — no content-based signals. 856 + 857 + The embedder uses the official `github.com/openai/openai-go` SDK with `option.WithBaseURL()`, so any OpenAI-compatible `/v1/embeddings` endpoint works (OpenAI, Ollama, local inference servers). 858 + 859 + vec0 tables are created dynamically at startup with the configured dimension (`GLEAN_EMBED_DIMENSION`, default 1536): 860 + 861 + ```sql 862 + CREATE VIRTUAL TABLE recs.feed_embeddings USING vec0( 863 + feed_url TEXT PRIMARY KEY, 864 + embedding float[1536] 865 + ); 866 + 867 + CREATE VIRTUAL TABLE recs.article_embeddings USING vec0( 868 + article_id INTEGER PRIMARY KEY, 869 + embedding float[1536] 870 + ); 871 + ``` 872 + 873 + Since vec0 virtual tables cannot hold metadata columns, a side table tracks the source text for re-embedding on description changes: 874 + 875 + ```sql 876 + CREATE TABLE recs.feed_embedding_meta ( 877 + feed_url TEXT PRIMARY KEY, 878 + source_text TEXT NOT NULL DEFAULT '' 879 + ); 880 + ``` 881 + 882 + During cron, `ComputeArticleEmbeddings` embeds new articles in batches (using `title + summary + full_content + content` for maximum semantic coverage) and inserts them into the vec0 table. `ComputeFeedEmbeddings` embeds feed descriptions (`title || description`) and re-embeds when the source text changes (detected via `feed_embedding_meta`). During on-demand article recommendations, the user's liked article embeddings are averaged into an interest vector, then a vec0 KNN query finds the top-200 most semantically similar articles. For cold-start users (<5 subscriptions), their subscribed feed embeddings are averaged and a KNN query finds similar feeds. 883 + 815 884 ## 8. HTTP API / htmx Endpoints 816 885 817 886 The server renders HTML fragments that htmx swaps into the page. No JSON API needed for the frontend. ··· 903 972 │ │ └── metrics.go # Prometheus metrics definitions 904 973 │ ├── cluster/ 905 974 │ │ ├── jaccard.go # Jaccard similarity computation 975 + │ │ ├── embed.go # Embedder interface + OpenAI-compatible implementation 976 + │ │ ├── article.go # Article + feed embedding computation, vec0 KNN content boost 906 977 │ │ ├── scoring.go # Feed + people + article recommendation queries (on-demand) 907 - │ │ ├── social.go # Follow-distance computation (1-2 hop) 978 + │ │ ├── social.go # Incremental follow-distance computation (1-3 hop, dirty-flag) 908 979 │ │ ├── dismiss.go # Dismiss + impression tracking 909 980 │ │ ├── weights.go # Bandit-style signal weight auto-tuning 910 981 │ │ ├── diversity.go # Post-query domain/category diversity filtering ··· 994 1065 995 1066 ``` 996 1067 Cron (every 10m) ──► Cluster Engine 997 - 998 - ├─► Compute feed similarity 999 - ├─► Compute user similarity 1000 - ├─► Compute follow distances 1001 - ├─► Compute signal profiles 1002 - └─► Auto-dismiss stale recommendations 1068 + 1069 + ├─► Compute feed similarity 1070 + ├─► Compute user similarity 1071 + ├─► Compute article embeddings (if embedder configured) 1072 + ├─► Compute follow distances 1073 + ├─► Compute signal profiles 1074 + └─► Auto-dismiss stale recommendations 1003 1075 1004 1076 Browser ──GET /dashboard──► Server 1005 1077
+8 -2
go.mod
··· 3 3 go 1.26.2 4 4 5 5 require ( 6 + github.com/asg017/sqlite-vec-go-bindings v0.1.6 6 7 github.com/bluesky-social/indigo v0.0.0-20260417172304-7da09df6081d 7 8 github.com/bluesky-social/jetstream v0.0.0-20260415170838-8a65de4eda28 8 9 github.com/go-chi/chi/v5 v5.2.5 9 10 github.com/go-chi/cors v1.2.2 10 11 github.com/mattn/go-sqlite3 v1.14.22 12 + github.com/openai/openai-go v1.12.0 11 13 github.com/prometheus/client_golang v1.19.1 14 + github.com/prometheus/client_model v0.6.1 15 + github.com/prometheus/common v0.54.0 12 16 go.uber.org/atomic v1.11.0 13 17 golang.org/x/net v0.53.0 14 18 golang.org/x/sync v0.20.0 ··· 35 39 github.com/multiformats/go-multibase v0.2.0 // indirect 36 40 github.com/multiformats/go-multihash v0.2.3 // indirect 37 41 github.com/multiformats/go-varint v0.0.7 // indirect 38 - github.com/prometheus/client_model v0.6.1 // indirect 39 - github.com/prometheus/common v0.54.0 // indirect 40 42 github.com/prometheus/procfs v0.15.1 // indirect 41 43 github.com/spaolacci/murmur3 v1.1.0 // indirect 44 + github.com/tidwall/gjson v1.14.4 // indirect 45 + github.com/tidwall/match v1.1.1 // indirect 46 + github.com/tidwall/pretty v1.2.1 // indirect 47 + github.com/tidwall/sjson v1.2.5 // indirect 42 48 github.com/whyrusleeping/cbor-gen v0.2.1-0.20241030202151-b7a6831be65e // indirect 43 49 gitlab.com/yawning/secp256k1-voi v0.0.0-20230925100816-f2616030848b // indirect 44 50 gitlab.com/yawning/tuplehash v0.0.0-20230713102510-df83abbf9a02 // indirect
+14
go.sum
··· 1 + github.com/asg017/sqlite-vec-go-bindings v0.1.6 h1:Nx0jAzyS38XpkKznJ9xQjFXz2X9tI7KqjwVxV8RNoww= 2 + github.com/asg017/sqlite-vec-go-bindings v0.1.6/go.mod h1:A8+cTt/nKFsYCQF6OgzSNpKZrzNo5gQsXBTfsXHXY0Q= 1 3 github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM= 2 4 github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw= 3 5 github.com/bluesky-social/indigo v0.0.0-20260417172304-7da09df6081d h1:ThKFUrkm2/IZwbvmIKLJYr0wPHibtCkIVmuZCWmdIHM= ··· 49 51 github.com/multiformats/go-multihash v0.2.3/go.mod h1:dXgKXCXjBzdscBLk9JkjINiEsCKRVch90MdaGiKsvSM= 50 52 github.com/multiformats/go-varint v0.0.7 h1:sWSGR+f/eu5ABZA2ZpYKBILXTTs9JWpdEM/nEGOHFS8= 51 53 github.com/multiformats/go-varint v0.0.7/go.mod h1:r8PUYw/fD/SjBCiKOoDlGF6QawOELpZAu9eioSos/OU= 54 + github.com/openai/openai-go v1.12.0 h1:NBQCnXzqOTv5wsgNC36PrFEiskGfO5wccfCWDo9S1U0= 55 + github.com/openai/openai-go v1.12.0/go.mod h1:g461MYGXEXBVdV5SaR/5tNzNbSfwTBBefwc+LlDCK0Y= 52 56 github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= 53 57 github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= 54 58 github.com/prometheus/client_golang v1.19.1 h1:wZWJDwK+NameRJuPGDhlnFgx8e8HN3XHQeLaYJFJBOE= ··· 63 67 github.com/spaolacci/murmur3 v1.1.0/go.mod h1:JwIasOWyU6f++ZhiEuf87xNszmSA2myDM2Kzu9HwQUA= 64 68 github.com/stretchr/testify v1.10.0 h1:Xv5erBjTwe/5IxqUQTdXv5kgmIvbHo3QQyRwhJsOfJA= 65 69 github.com/stretchr/testify v1.10.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY= 70 + github.com/tidwall/gjson v1.14.2/go.mod h1:/wbyibRr2FHMks5tjHJ5F8dMZh3AcwJEMf5vlfC0lxk= 71 + github.com/tidwall/gjson v1.14.4 h1:uo0p8EbA09J7RQaflQ1aBRffTR7xedD2bcIVSYxLnkM= 72 + github.com/tidwall/gjson v1.14.4/go.mod h1:/wbyibRr2FHMks5tjHJ5F8dMZh3AcwJEMf5vlfC0lxk= 73 + github.com/tidwall/match v1.1.1 h1:+Ho715JplO36QYgwN9PGYNhgZvoUSc9X2c80KVTi+GA= 74 + github.com/tidwall/match v1.1.1/go.mod h1:eRSPERbgtNPcGhD8UCthc6PmLEQXEWd3PRB5JTxsfmM= 75 + github.com/tidwall/pretty v1.2.0/go.mod h1:ITEVvHYasfjBbM0u2Pg8T2nJnzm8xPwvNhhsoaGGjNU= 76 + github.com/tidwall/pretty v1.2.1 h1:qjsOFOWWQl+N3RsoF5/ssm1pHmJJwhjlSbZ51I6wMl4= 77 + github.com/tidwall/pretty v1.2.1/go.mod h1:ITEVvHYasfjBbM0u2Pg8T2nJnzm8xPwvNhhsoaGGjNU= 78 + github.com/tidwall/sjson v1.2.5 h1:kLy8mja+1c9jlljvWTlSazM7cKDRfJuR/bOJhcY5NcY= 79 + github.com/tidwall/sjson v1.2.5/go.mod h1:Fvgq9kS/6ociJEDnK0Fk1cpYF4FIW6ZF7LAe+6jwd28= 66 80 github.com/whyrusleeping/cbor-gen v0.2.1-0.20241030202151-b7a6831be65e h1:28X54ciEwwUxyHn9yrZfl5ojgF4CBNLWX7LR0rvBkf4= 67 81 github.com/whyrusleeping/cbor-gen v0.2.1-0.20241030202151-b7a6831be65e/go.mod h1:pM99HXyEbSQHcosHc0iW7YFmwnscr+t9Te4ibko05so= 68 82 gitlab.com/yawning/secp256k1-voi v0.0.0-20230925100816-f2616030848b h1:CzigHMRySiX3drau9C6Q5CAbNIApmLdat5jPMqChvDA=
+378
internal/cluster/article.go
··· 1 + package cluster 2 + 3 + import ( 4 + "context" 5 + "database/sql" 6 + "fmt" 7 + "log/slog" 8 + "strings" 9 + 10 + vec "github.com/asg017/sqlite-vec-go-bindings/cgo" 11 + ) 12 + 13 + const embedBatchSize = 100 14 + 15 + // ComputeArticleEmbeddings embeds new articles (title + summary) into the 16 + // article_embeddings vec0 table. Skipped when no embedder is configured. 17 + // Existing embeddings for deleted articles are cleaned up. Articles already 18 + // embedded are not re-embedded. 19 + func (e *Engine) ComputeArticleEmbeddings(ctx context.Context) error { 20 + if e.embedder == nil { 21 + e.logger.Debug("article embeddings skipped, no embedder") 22 + return nil 23 + } 24 + 25 + conn, err := e.db.Conn(ctx) 26 + if err != nil { 27 + return err 28 + } 29 + defer conn.Close() 30 + 31 + _, err = conn.ExecContext(ctx, ` 32 + DELETE FROM recs.article_embeddings 33 + WHERE article_id NOT IN (SELECT id FROM articles.articles) 34 + `) 35 + if err != nil { 36 + return fmt.Errorf("clean stale article embeddings: %w", err) 37 + } 38 + 39 + rows, err := conn.QueryContext(ctx, ` 40 + SELECT a.id, COALESCE(a.title, '') || ' ' || COALESCE(a.summary, '') || ' ' || COALESCE(a.full_content, '') || ' ' || COALESCE(a.content, '') 41 + FROM articles.articles a 42 + WHERE (COALESCE(a.title, '') != '' OR COALESCE(a.summary, '') != '' OR COALESCE(a.full_content, '') != '' OR COALESCE(a.content, '') != '') 43 + AND a.id NOT IN (SELECT article_id FROM recs.article_embeddings) 44 + ORDER BY a.id 45 + `) 46 + if err != nil { 47 + return err 48 + } 49 + 50 + type article struct { 51 + id int64 52 + text string 53 + } 54 + var batch []article 55 + for rows.Next() { 56 + var a article 57 + if err := rows.Scan(&a.id, &a.text); err != nil { 58 + rows.Close() 59 + return err 60 + } 61 + batch = append(batch, a) 62 + } 63 + rows.Close() 64 + 65 + if len(batch) == 0 { 66 + e.logger.Info("article embeddings up to date") 67 + return nil 68 + } 69 + 70 + for i := 0; i < len(batch); i += embedBatchSize { 71 + end := min(i+embedBatchSize, len(batch)) 72 + sub := batch[i:end] 73 + 74 + texts := make([]string, len(sub)) 75 + for j, a := range sub { 76 + texts[j] = a.text 77 + } 78 + 79 + embeddings, err := e.embedder.Embed(ctx, texts) 80 + if err != nil { 81 + return fmt.Errorf("embed batch %d: %w", i/embedBatchSize, err) 82 + } 83 + 84 + tx, err := conn.BeginTx(ctx, nil) 85 + if err != nil { 86 + return err 87 + } 88 + defer func() { _ = tx.Rollback() }() 89 + 90 + for j, emb := range embeddings { 91 + blob, err := vec.SerializeFloat32(emb) 92 + if err != nil { 93 + return fmt.Errorf("serialize embedding: %w", err) 94 + } 95 + if _, err := tx.ExecContext(ctx, 96 + `INSERT OR IGNORE INTO recs.article_embeddings(article_id, embedding) VALUES (?, ?)`, 97 + sub[j].id, blob, 98 + ); err != nil { 99 + return fmt.Errorf("insert embedding: %w", err) 100 + } 101 + } 102 + 103 + if err := tx.Commit(); err != nil { 104 + return err 105 + } 106 + 107 + e.logger.Info("article embeddings batch computed", 108 + slog.Int("batch", i/embedBatchSize), 109 + slog.Int("count", len(sub)), 110 + ) 111 + } 112 + 113 + e.logger.Info("article embeddings computed", slog.Int("total", len(batch))) 114 + return nil 115 + } 116 + 117 + func (e *Engine) populateContentBoost(ctx context.Context, conn *sql.Conn, userDID string) error { 118 + rows, err := conn.QueryContext(ctx, ` 119 + SELECT a.id FROM articles.likes ul 120 + JOIN articles.articles a ON a.feed_url = ul.feed_url AND a.url = ul.article_url 121 + WHERE ul.author_did = ? 122 + `, userDID) 123 + if err != nil { 124 + return err 125 + } 126 + var articleIDs []int64 127 + for rows.Next() { 128 + var id int64 129 + if err := rows.Scan(&id); err != nil { 130 + rows.Close() 131 + return err 132 + } 133 + articleIDs = append(articleIDs, id) 134 + } 135 + rows.Close() 136 + 137 + if len(articleIDs) == 0 { 138 + return nil 139 + } 140 + 141 + ph := make([]string, len(articleIDs)) 142 + args := make([]any, len(articleIDs)) 143 + for i, id := range articleIDs { 144 + ph[i] = "?" 145 + args[i] = id 146 + } 147 + embRows, err := conn.QueryContext(ctx, 148 + fmt.Sprintf("SELECT article_id, embedding FROM recs.article_embeddings WHERE article_id IN (%s)", joinPh(ph)), 149 + args..., 150 + ) 151 + if err != nil { 152 + return err 153 + } 154 + 155 + dim := e.embedder.Dimension() 156 + sumVec := make([]float32, dim) 157 + count := 0 158 + likedSet := make(map[int64]bool) 159 + for embRows.Next() { 160 + var id int64 161 + var blob []byte 162 + if err := embRows.Scan(&id, &blob); err != nil { 163 + embRows.Close() 164 + return err 165 + } 166 + v := deserializeFloat32(blob) 167 + if len(v) != dim { 168 + continue 169 + } 170 + for j := range sumVec { 171 + sumVec[j] += v[j] 172 + } 173 + count++ 174 + likedSet[id] = true 175 + } 176 + embRows.Close() 177 + 178 + if count == 0 { 179 + return nil 180 + } 181 + 182 + avgVec := make([]float32, dim) 183 + for j := range avgVec { 184 + avgVec[j] = sumVec[j] / float32(count) 185 + } 186 + 187 + queryBlob, err := vec.SerializeFloat32(avgVec) 188 + if err != nil { 189 + return fmt.Errorf("serialize query vector: %w", err) 190 + } 191 + 192 + const topK = 200 193 + knnRows, err := conn.QueryContext(ctx, ` 194 + SELECT article_id, distance FROM recs.article_embeddings 195 + WHERE embedding MATCH ? AND k = ? 196 + ORDER BY distance 197 + `, queryBlob, topK+len(likedSet)) 198 + if err != nil { 199 + return err 200 + } 201 + 202 + tx, err := conn.BeginTx(ctx, nil) 203 + if err != nil { 204 + return err 205 + } 206 + defer func() { _ = tx.Rollback() }() 207 + 208 + for knnRows.Next() { 209 + var id int64 210 + var dist float64 211 + if err := knnRows.Scan(&id, &dist); err != nil { 212 + knnRows.Close() 213 + return err 214 + } 215 + if likedSet[id] { 216 + continue 217 + } 218 + score := 1.0 - dist 219 + if score <= 0 { 220 + continue 221 + } 222 + if _, err := tx.ExecContext(ctx, 223 + `INSERT OR IGNORE INTO _content_boost (article_id, score) VALUES (?, ?)`, 224 + id, score, 225 + ); err != nil { 226 + return err 227 + } 228 + } 229 + knnRows.Close() 230 + 231 + return tx.Commit() 232 + } 233 + 234 + // ComputeFeedEmbeddings embeds feed descriptions (title + description) into the 235 + // feed_embeddings vec0 table and tracks source text in feed_embedding_meta for 236 + // re-embedding on description change. Skipped when no embedder is configured. 237 + func (e *Engine) ComputeFeedEmbeddings(ctx context.Context) error { 238 + if e.embedder == nil { 239 + e.logger.Debug("feed embeddings skipped, no embedder") 240 + return nil 241 + } 242 + 243 + conn, err := e.db.Conn(ctx) 244 + if err != nil { 245 + return err 246 + } 247 + defer conn.Close() 248 + 249 + _, err = conn.ExecContext(ctx, ` 250 + DELETE FROM recs.feed_embeddings 251 + WHERE feed_url NOT IN (SELECT feed_url FROM articles.feeds) 252 + `) 253 + if err != nil { 254 + return fmt.Errorf("clean stale feed embeddings: %w", err) 255 + } 256 + _, err = conn.ExecContext(ctx, ` 257 + DELETE FROM recs.feed_embedding_meta 258 + WHERE feed_url NOT IN (SELECT feed_url FROM articles.feeds) 259 + `) 260 + if err != nil { 261 + return fmt.Errorf("clean stale feed embeddings: %w", err) 262 + } 263 + 264 + rows, err := conn.QueryContext(ctx, ` 265 + SELECT f.feed_url, COALESCE(f.title, '') || ' ' || COALESCE(f.description, '') 266 + FROM articles.feeds f 267 + WHERE (COALESCE(f.title, '') != '' OR COALESCE(f.description, '') != '') 268 + AND ( 269 + f.feed_url NOT IN (SELECT feed_url FROM recs.feed_embedding_meta) 270 + OR EXISTS ( 271 + SELECT 1 FROM recs.feed_embedding_meta fm 272 + WHERE fm.feed_url = f.feed_url 273 + AND fm.source_text != COALESCE(f.title, '') || ' ' || COALESCE(f.description, '') 274 + ) 275 + ) 276 + ORDER BY f.feed_url 277 + `) 278 + if err != nil { 279 + return err 280 + } 281 + 282 + type feed struct { 283 + url string 284 + text string 285 + } 286 + var batch []feed 287 + for rows.Next() { 288 + var f feed 289 + if err := rows.Scan(&f.url, &f.text); err != nil { 290 + rows.Close() 291 + return err 292 + } 293 + batch = append(batch, f) 294 + } 295 + rows.Close() 296 + 297 + if len(batch) == 0 { 298 + e.logger.Info("feed embeddings up to date") 299 + return nil 300 + } 301 + 302 + for i := 0; i < len(batch); i += embedBatchSize { 303 + end := min(i+embedBatchSize, len(batch)) 304 + sub := batch[i:end] 305 + 306 + texts := make([]string, len(sub)) 307 + for j, f := range sub { 308 + texts[j] = f.text 309 + } 310 + 311 + embeddings, err := e.embedder.Embed(ctx, texts) 312 + if err != nil { 313 + return fmt.Errorf("embed feed batch %d: %w", i/embedBatchSize, err) 314 + } 315 + 316 + tx, err := conn.BeginTx(ctx, nil) 317 + if err != nil { 318 + return err 319 + } 320 + defer func() { _ = tx.Rollback() }() 321 + 322 + for j, emb := range embeddings { 323 + blob, err := vec.SerializeFloat32(emb) 324 + if err != nil { 325 + return fmt.Errorf("serialize feed embedding: %w", err) 326 + } 327 + if _, err := tx.ExecContext(ctx, 328 + `DELETE FROM recs.feed_embeddings WHERE feed_url = ?`, sub[j].url, 329 + ); err != nil { 330 + return fmt.Errorf("delete feed embedding: %w", err) 331 + } 332 + if _, err := tx.ExecContext(ctx, 333 + `INSERT INTO recs.feed_embeddings(feed_url, embedding) VALUES (?, ?)`, 334 + sub[j].url, blob, 335 + ); err != nil { 336 + return fmt.Errorf("insert feed embedding: %w", err) 337 + } 338 + if _, err := tx.ExecContext(ctx, 339 + `INSERT OR REPLACE INTO recs.feed_embedding_meta(feed_url, source_text) VALUES (?, ?)`, 340 + sub[j].url, sub[j].text, 341 + ); err != nil { 342 + return fmt.Errorf("insert feed embedding: %w", err) 343 + } 344 + } 345 + 346 + if err := tx.Commit(); err != nil { 347 + return err 348 + } 349 + 350 + e.logger.Info("feed embeddings batch computed", 351 + slog.Int("batch", i/embedBatchSize), 352 + slog.Int("count", len(sub)), 353 + ) 354 + } 355 + 356 + e.logger.Info("feed embeddings computed", slog.Int("total", len(batch))) 357 + return nil 358 + } 359 + 360 + func (e *Engine) ensureContentBoostTable(ctx context.Context, conn *sql.Conn) error { 361 + _, err := conn.ExecContext(ctx, `CREATE TEMP TABLE IF NOT EXISTS _content_boost (article_id INT PRIMARY KEY, score REAL)`) 362 + if err != nil { 363 + return err 364 + } 365 + _, err = conn.ExecContext(ctx, `DELETE FROM _content_boost`) 366 + return err 367 + } 368 + 369 + func joinPh(ph []string) string { 370 + var s strings.Builder 371 + for i, p := range ph { 372 + if i > 0 { 373 + s.WriteString(",") 374 + } 375 + s.WriteString(p) 376 + } 377 + return s.String() 378 + }
+15 -3
internal/cluster/cron.go
··· 8 8 "pkg.rbrt.fr/glean/internal/metrics" 9 9 ) 10 10 11 + // Cron periodically runs all cluster engine computations (similarity, embeddings, 12 + // follow distances, signal profiles, auto-dismiss) on a fixed interval. 11 13 type Cron struct { 12 14 engine *Engine 13 15 interval time.Duration 14 16 logger *slog.Logger 15 17 } 16 18 19 + // NewCron creates a new cron runner with the given engine and interval. 17 20 func NewCron(engine *Engine, interval time.Duration, logger *slog.Logger) *Cron { 18 21 return &Cron{engine: engine, interval: interval, logger: logger} 19 22 } 20 23 24 + // Run starts the cron loop. It blocks until ctx is cancelled. Each tick runs 25 + // all computations sequentially; if a previous run is still in progress the 26 + // tick is skipped. 21 27 func (c *Cron) Run(ctx context.Context) error { 22 28 for { 23 29 c.logger.Info("starting similarity computation") ··· 26 32 if !c.engine.mu.TryLock() { 27 33 c.logger.Info("skipping computation: already in progress") 28 34 } else { 35 + if err := c.engine.ComputeFeedEmbeddings(ctx); err != nil { 36 + c.engine.logger.Error("feed embeddings failed", "error", err) 37 + } 29 38 if err := c.engine.ComputeFeedSimilarity(ctx); err != nil { 30 39 c.engine.logger.Error("feed similarity failed", "error", err) 31 40 } 32 41 if err := c.engine.ComputeUserSimilarity(ctx); err != nil { 33 42 c.engine.logger.Error("user similarity failed", "error", err) 34 43 } 35 - if err := c.engine.ComputeFollowDistances(ctx); err != nil { 36 - c.engine.logger.Error("follow distances failed", "error", err) 44 + if err := c.engine.ComputeArticleEmbeddings(ctx); err != nil { 45 + c.engine.logger.Error("article embeddings failed", "error", err) 37 46 } 38 47 if err := c.engine.ComputeSignalProfiles(ctx); err != nil { 39 48 c.engine.logger.Error("signal profiles failed", "error", err) 40 49 } 41 - if err := c.engine.AutoDismissStale(ctx, 15, 30); err != nil { 50 + if err := c.engine.ComputeFollowDistances(ctx); err != nil { 51 + c.logger.Error("follow distances failed", "error", err) 52 + } 53 + if err := c.engine.AutoDismissStale(ctx, 5, 5); err != nil { 42 54 c.engine.logger.Error("auto dismiss failed", "error", err) 43 55 } 44 56 c.engine.mu.Unlock()
+11 -11
internal/cluster/dismiss.go
··· 5 5 "time" 6 6 ) 7 7 8 + // Impression records that a recommendation was shown to a user. 8 9 type Impression struct { 9 10 TargetType string 10 11 TargetID string 11 12 } 12 13 13 14 func (e *Engine) DismissFeed(ctx context.Context, userDID, feedURL, reason string) error { 14 - return nil 15 15 _, err := e.db.ExecContext(ctx, ` 16 - INSERT INTO recs.dismissed_recommendations (user_did, target_type, target_id, reason) 16 + INSERT INTO main.dismissed_recommendations (user_did, target_type, target_id, reason) 17 17 VALUES (?, 'feed', ?, ?) 18 18 ON CONFLICT(user_did, target_type, target_id) DO UPDATE SET reason = excluded.reason, dismissed_at = CURRENT_TIMESTAMP 19 19 `, userDID, feedURL, reason) ··· 21 21 } 22 22 23 23 func (e *Engine) DismissArticle(ctx context.Context, userDID, articleURL, reason string) error { 24 - return nil 25 24 _, err := e.db.ExecContext(ctx, ` 26 - INSERT INTO recs.dismissed_recommendations (user_did, target_type, target_id, reason) 25 + INSERT INTO main.dismissed_recommendations (user_did, target_type, target_id, reason) 27 26 VALUES (?, 'article', ?, ?) 28 27 ON CONFLICT(user_did, target_type, target_id) DO UPDATE SET reason = excluded.reason, dismissed_at = CURRENT_TIMESTAMP 29 28 `, userDID, articleURL, reason) ··· 31 30 } 32 31 33 32 func (e *Engine) RecordImpressions(ctx context.Context, userDID string, impressions []Impression) error { 34 - return nil 35 33 tx, err := e.db.BeginTx(ctx, nil) 36 34 if err != nil { 37 35 return err ··· 40 38 41 39 for _, imp := range impressions { 42 40 _, err := tx.ExecContext(ctx, ` 43 - INSERT INTO recs.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count) 41 + INSERT INTO main.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count) 44 42 VALUES (?, ?, ?, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1) 45 43 ON CONFLICT(user_did, target_type, target_id) DO UPDATE SET 46 44 last_shown_at = CURRENT_TIMESTAMP, ··· 54 52 } 55 53 56 54 func (e *Engine) MarkImpressionActed(ctx context.Context, userDID, targetType, targetID string) error { 57 - return nil 58 55 _, err := e.db.ExecContext(ctx, ` 59 - UPDATE recs.recommendation_impressions SET acted = 1 56 + UPDATE main.recommendation_impressions SET acted = 1 60 57 WHERE user_did = ? AND target_type = ? AND target_id = ? 61 58 `, userDID, targetType, targetID) 62 59 return err 63 60 } 64 61 62 + // AutoDismissStale marks recommendations as dismissed if they were shown at 63 + // least minShownCount times over more than maxAgeDays without the user acting 64 + // on them. 65 65 func (e *Engine) AutoDismissStale(ctx context.Context, minShownCount int, maxAgeDays int) error { 66 66 cutoff := time.Now().AddDate(0, 0, -maxAgeDays).Format(time.RFC3339) 67 67 68 68 _, err := e.db.ExecContext(ctx, ` 69 - INSERT OR IGNORE INTO recs.dismissed_recommendations (user_did, target_type, target_id, reason, dismissed_at) 69 + INSERT OR IGNORE INTO main.dismissed_recommendations (user_did, target_type, target_id, reason, dismissed_at) 70 70 SELECT user_did, target_type, target_id, 'auto_stale', CURRENT_TIMESTAMP 71 - FROM recs.recommendation_impressions 71 + FROM main.recommendation_impressions 72 72 WHERE acted = 0 73 73 AND shown_count >= ? 74 74 AND first_shown_at < ? ··· 79 79 func (e *Engine) IsFeedDismissed(ctx context.Context, userDID, feedURL string) (bool, error) { 80 80 var count int 81 81 err := e.db.QueryRowContext(ctx, ` 82 - SELECT COUNT(1) FROM recs.dismissed_recommendations 82 + SELECT COUNT(1) FROM main.dismissed_recommendations 83 83 WHERE user_did = ? AND target_type = 'feed' AND target_id = ? 84 84 `, userDID, feedURL).Scan(&count) 85 85 return count > 0, err
+3
internal/cluster/diversity.go
··· 8 8 const maxPerDomain = 2 9 9 const maxPerCategory = 3 10 10 11 + // ApplyDiversity filters candidates to limit how many feeds come from the same 12 + // domain (max 2) or category (max 3). Candidates are assumed to be sorted by 13 + // score descending. 11 14 func ApplyDiversity(candidates []*FeedRecommendation, topN int) []*FeedRecommendation { 12 15 domainCount := make(map[string]int, len(candidates)) 13 16 categoryCount := make(map[string]int)
+80
internal/cluster/embed.go
··· 1 + package cluster 2 + 3 + import ( 4 + "bytes" 5 + "context" 6 + "encoding/binary" 7 + 8 + "github.com/openai/openai-go" 9 + "github.com/openai/openai-go/option" 10 + ) 11 + 12 + // Embedder generates vector embeddings for text inputs. Implementations must be 13 + // safe for concurrent use. 14 + type Embedder interface { 15 + Embed(ctx context.Context, texts []string) ([][]float32, error) 16 + Dimension() int 17 + } 18 + 19 + type OpenAIEmbedder struct { 20 + client openai.Client 21 + model string 22 + dimension int 23 + } 24 + 25 + type OpenAIEmbedderConfig struct { 26 + BaseURL string 27 + APIKey string 28 + Model string 29 + Dimension int 30 + } 31 + 32 + func NewOpenAIEmbedder(cfg OpenAIEmbedderConfig) *OpenAIEmbedder { 33 + opts := []option.RequestOption{} 34 + if cfg.BaseURL != "" { 35 + opts = append(opts, option.WithBaseURL(cfg.BaseURL)) 36 + } 37 + if cfg.APIKey != "" { 38 + opts = append(opts, option.WithAPIKey(cfg.APIKey)) 39 + } 40 + return &OpenAIEmbedder{ 41 + client: openai.NewClient(opts...), 42 + model: cfg.Model, 43 + dimension: cfg.Dimension, 44 + } 45 + } 46 + 47 + func (e *OpenAIEmbedder) Dimension() int { 48 + return e.dimension 49 + } 50 + 51 + func (e *OpenAIEmbedder) Embed(ctx context.Context, texts []string) ([][]float32, error) { 52 + resp, err := e.client.Embeddings.New(ctx, openai.EmbeddingNewParams{ 53 + Model: e.model, 54 + Input: openai.EmbeddingNewParamsInputUnion{ 55 + OfArrayOfStrings: texts, 56 + }, 57 + }) 58 + if err != nil { 59 + return nil, err 60 + } 61 + 62 + embeddings := make([][]float32, len(resp.Data)) 63 + for i, d := range resp.Data { 64 + embeddings[i] = make([]float32, len(d.Embedding)) 65 + for j, v := range d.Embedding { 66 + embeddings[i][j] = float32(v) 67 + } 68 + } 69 + return embeddings, nil 70 + } 71 + 72 + func deserializeFloat32(data []byte) []float32 { 73 + if len(data)%4 != 0 { 74 + return nil 75 + } 76 + result := make([]float32, len(data)/4) 77 + r := bytes.NewReader(data) 78 + _ = binary.Read(r, binary.LittleEndian, &result) 79 + return result 80 + }
+364 -284
internal/cluster/jaccard.go
··· 5 5 "database/sql" 6 6 "fmt" 7 7 "log/slog" 8 + "math" 8 9 "sync" 9 10 ) 10 11 12 + // Config controls weights used during similarity computation (feed similarity 13 + // embedding boost, user similarity like/tag/follow contributions). These are 14 + // distinct from SignalWeights which control on-demand scoring multipliers. 11 15 type Config struct { 12 16 FollowBoost float64 13 17 LikesWeight float64 ··· 15 19 DescriptionWeight float64 16 20 } 17 21 22 + // DefaultConfig returns the default similarity computation weights. 18 23 func DefaultConfig() Config { 19 24 return Config{ 20 25 FollowBoost: 0.5, ··· 24 29 } 25 30 } 26 31 27 - // Engine uses *sql.DB directly because it performs cross-schema transactions 28 - // across main, articles, and recs. Typed stores would add overhead without benefit here. 32 + // Engine is the recommendation engine. It holds a reference to the SQLite 33 + // database, an optional embedder for content-based signals, and configuration. 34 + // All public methods are safe for concurrent use (cron writes, on-demand reads). 29 35 type Engine struct { 30 - db *sql.DB 31 - logger *slog.Logger 32 - mu sync.Mutex 33 - config Config 36 + db *sql.DB 37 + logger *slog.Logger 38 + mu sync.Mutex 39 + config Config 40 + embedder Embedder 34 41 } 35 42 36 - func NewEngine(db *sql.DB, logger *slog.Logger) *Engine { 37 - return &Engine{db: db, logger: logger, config: DefaultConfig()} 43 + // NewEngine creates a new recommendation engine. Pass nil for embedder to 44 + // disable content-based signals (no embedding computation, no KNN queries). 45 + func NewEngine(db *sql.DB, embedder Embedder, logger *slog.Logger) *Engine { 46 + return &Engine{db: db, logger: logger, config: DefaultConfig(), embedder: embedder} 38 47 } 39 48 49 + // ComputeFeedSimilarity recomputes the feed_similarity table: time-decayed 50 + // subscriber Jaccard for all feed pairs with shared subscribers, plus an 51 + // embedding cosine similarity boost for pairs that both have embeddings. 40 52 func (e *Engine) ComputeFeedSimilarity(ctx context.Context) error { 41 - tx, err := e.db.BeginTx(ctx, nil) 53 + conn, err := e.db.Conn(ctx) 42 54 if err != nil { 43 55 return err 44 56 } 45 - defer func() { _ = tx.Rollback() }() 57 + defer conn.Close() 46 58 47 - if _, err := tx.ExecContext(ctx, `DELETE FROM recs.feed_similarity`); err != nil { 48 - return err 49 - } 59 + { 60 + tx, err := conn.BeginTx(ctx, nil) 61 + if err != nil { 62 + return err 63 + } 64 + defer func() { _ = tx.Rollback() }() 50 65 51 - _, err = tx.ExecContext(ctx, ` 52 - INSERT INTO recs.feed_similarity (feed_a, feed_b, jaccard) 53 - SELECT 54 - s1.feed_url, 55 - s2.feed_url, 56 - CAST(COUNT(*) AS REAL) / (f1.subscriber_count + f2.subscriber_count - CAST(COUNT(*) AS REAL)) 57 - FROM articles.subscriptions s1 58 - JOIN articles.subscriptions s2 ON s1.user_did = s2.user_did AND s1.feed_url < s2.feed_url 59 - JOIN articles.feeds f1 ON f1.feed_url = s1.feed_url 60 - JOIN articles.feeds f2 ON f2.feed_url = s2.feed_url 61 - GROUP BY s1.feed_url, s2.feed_url 62 - `) 63 - if err != nil { 64 - return err 65 - } 66 + if _, err := tx.ExecContext(ctx, `CREATE TEMP TABLE IF NOT EXISTS _feed_sim_staging ( 67 + feed_a TEXT NOT NULL, feed_b TEXT NOT NULL, jaccard REAL NOT NULL, 68 + PRIMARY KEY (feed_a, feed_b))`); err != nil { 69 + return err 70 + } 71 + if _, err := tx.ExecContext(ctx, `DELETE FROM _feed_sim_staging`); err != nil { 72 + return err 73 + } 66 74 67 - if err := e.computeDescriptionSimilarity(ctx, tx); err != nil { 68 - e.logger.Warn("description similarity failed", "error", err) 75 + _, err = tx.ExecContext(ctx, ` 76 + INSERT INTO _feed_sim_staging (feed_a, feed_b, jaccard) 77 + SELECT 78 + s1.feed_url, 79 + s2.feed_url, 80 + SUM(EXP(-0.023 * CAST(julianday('now') - julianday(MIN(s1.added_at, s2.added_at)) AS REAL))) 81 + / (f1.subscriber_count + f2.subscriber_count - CAST(COUNT(*) AS REAL)) 82 + FROM articles.subscriptions s1 83 + JOIN articles.subscriptions s2 ON s1.user_did = s2.user_did AND s1.feed_url < s2.feed_url 84 + JOIN articles.feeds f1 ON f1.feed_url = s1.feed_url 85 + JOIN articles.feeds f2 ON f2.feed_url = s2.feed_url 86 + WHERE s1.added_at IS NOT NULL AND s2.added_at IS NOT NULL 87 + GROUP BY s1.feed_url, s2.feed_url 88 + `) 89 + if err != nil { 90 + return err 91 + } 92 + 93 + if err := e.computeEmbeddingSimilarity(ctx, tx); err != nil { 94 + e.logger.Warn("embedding similarity failed", "error", err) 95 + } 96 + 97 + if err := tx.Commit(); err != nil { 98 + return err 99 + } 69 100 } 70 101 71 - e.logger.Info("feed similarity computed") 72 - return tx.Commit() 73 - } 102 + { 103 + tx, err := conn.BeginTx(ctx, nil) 104 + if err != nil { 105 + return err 106 + } 107 + defer func() { _ = tx.Rollback() }() 108 + 109 + if _, err := tx.ExecContext(ctx, `DELETE FROM recs.feed_similarity`); err != nil { 110 + return err 111 + } 112 + if _, err := tx.ExecContext(ctx, `INSERT INTO recs.feed_similarity (feed_a, feed_b, jaccard) SELECT feed_a, feed_b, jaccard FROM _feed_sim_staging`); err != nil { 113 + return err 114 + } 74 115 75 - func (e *Engine) computeDescriptionSimilarity(ctx context.Context, tx *sql.Tx) error { 76 - if _, err := tx.ExecContext(ctx, `CREATE TEMP TABLE IF NOT EXISTS _feed_words (feed_url TEXT, word TEXT)`); err != nil { 77 - return err 116 + e.logger.Info("feed similarity computed") 117 + return tx.Commit() 78 118 } 79 - if _, err := tx.ExecContext(ctx, `DELETE FROM _feed_words`); err != nil { 80 - return err 119 + } 120 + 121 + func (e *Engine) computeEmbeddingSimilarity(ctx context.Context, tx *sql.Tx) error { 122 + if e.embedder == nil { 123 + return nil 81 124 } 125 + e.logger.Debug("computing embedding similarity") 82 126 83 - _, err := tx.ExecContext(ctx, ` 84 - INSERT INTO _feed_words (feed_url, word) 85 - WITH feed_tokens AS ( 86 - SELECT feed_url, LOWER(TRIM(value)) AS word 87 - FROM articles.feeds, 88 - json_each('["' || REPLACE(LOWER(COALESCE(description, '')), ' ', '","') || '"]') 89 - WHERE description IS NOT NULL AND description != '' 90 - ) 91 - SELECT feed_url, word FROM feed_tokens 92 - WHERE LENGTH(word) > 3 93 - AND word NOT IN ('about','also','been','being','both','could','every','from','have','here', 94 - 'into','just','like','more','much','must','other','over','some','such','than','that', 95 - 'their','them','then','there','these','they','this','through','very','what','when', 96 - 'where','which','while','will','with','your','most','updated','latest','posts', 97 - 'news','blog','feed','reading','read','articles','article','weekly','daily', 98 - 'monthly','personal','thoughts','views','opinions','writing','write','written') 99 - `) 127 + stagingRows, err := tx.QueryContext(ctx, `SELECT feed_a, feed_b FROM _feed_sim_staging`) 100 128 if err != nil { 101 129 return err 102 130 } 103 131 104 - if _, err := tx.ExecContext(ctx, ` 105 - CREATE TEMP TABLE IF NOT EXISTS _feed_word_counts (feed_url TEXT PRIMARY KEY, cnt INT) 106 - `); err != nil { 107 - return err 108 - } 109 - if _, err := tx.ExecContext(ctx, `DELETE FROM _feed_word_counts`); err != nil { 110 - return err 111 - } 112 - if _, err := tx.ExecContext(ctx, ` 113 - INSERT INTO _feed_word_counts (feed_url, cnt) 114 - SELECT feed_url, COUNT(DISTINCT word) FROM _feed_words GROUP BY feed_url 115 - `); err != nil { 116 - return err 132 + type pair struct{ a, b string } 133 + var pairs []pair 134 + feedSet := make(map[string]bool) 135 + for stagingRows.Next() { 136 + var p pair 137 + if err := stagingRows.Scan(&p.a, &p.b); err != nil { 138 + stagingRows.Close() 139 + return err 140 + } 141 + pairs = append(pairs, p) 142 + feedSet[p.a] = true 143 + feedSet[p.b] = true 117 144 } 145 + stagingRows.Close() 118 146 119 - if _, err := tx.ExecContext(ctx, ` 120 - CREATE TEMP TABLE IF NOT EXISTS _word_overlap (feed_a TEXT, feed_b TEXT, common INT) 121 - `); err != nil { 122 - return err 147 + if len(feedSet) == 0 { 148 + return nil 123 149 } 124 - if _, err := tx.ExecContext(ctx, `DELETE FROM _word_overlap`); err != nil { 125 - return err 150 + 151 + ph := make([]string, 0, len(feedSet)) 152 + args := make([]any, 0, len(feedSet)) 153 + for url := range feedSet { 154 + ph = append(ph, "?") 155 + args = append(args, url) 126 156 } 127 - if _, err := tx.ExecContext(ctx, ` 128 - INSERT INTO _word_overlap (feed_a, feed_b, common) 129 - SELECT w1.feed_url, w2.feed_url, COUNT(DISTINCT w1.word) 130 - FROM _feed_words w1 131 - JOIN _feed_words w2 ON w1.word = w2.word AND w1.feed_url < w2.feed_url 132 - GROUP BY w1.feed_url, w2.feed_url 133 - HAVING COUNT(DISTINCT w1.word) > 1 134 - `); err != nil { 157 + embRows, err := tx.QueryContext(ctx, 158 + fmt.Sprintf("SELECT feed_url, embedding FROM recs.feed_embeddings WHERE feed_url IN (%s)", joinPh(ph)), 159 + args..., 160 + ) 161 + if err != nil { 135 162 return err 136 163 } 137 164 138 - if _, err := tx.ExecContext(ctx, ` 139 - CREATE INDEX IF NOT EXISTS _idx_feed_words_word ON _feed_words(word, feed_url) 140 - `); err != nil { 141 - return err 165 + embeddings := make(map[string][]float32) 166 + for embRows.Next() { 167 + var url string 168 + var blob []byte 169 + if err := embRows.Scan(&url, &blob); err != nil { 170 + embRows.Close() 171 + return err 172 + } 173 + v := deserializeFloat32(blob) 174 + if len(v) > 0 { 175 + embeddings[url] = v 176 + } 142 177 } 143 - 144 - descInsert := ` 145 - INSERT OR IGNORE INTO recs.feed_similarity (feed_a, feed_b, jaccard) 146 - SELECT feed_a, feed_b, 0 FROM _word_overlap 147 - ` 148 - if _, err := tx.ExecContext(ctx, descInsert); err != nil { 149 - return err 150 - } 151 - 152 - descUpdate := fmt.Sprintf(` 153 - UPDATE recs.feed_similarity SET 154 - jaccard = jaccard + %g * CAST(_word_overlap.common AS REAL) / NULLIF( 155 - (SELECT cnt FROM _feed_word_counts WHERE feed_url = recs.feed_similarity.feed_a) + 156 - (SELECT cnt FROM _feed_word_counts WHERE feed_url = recs.feed_similarity.feed_b) - 157 - CAST(_word_overlap.common AS REAL), 158 - 0 159 - ) 160 - FROM _word_overlap 161 - WHERE recs.feed_similarity.feed_a = _word_overlap.feed_a 162 - AND recs.feed_similarity.feed_b = _word_overlap.feed_b 163 - `, e.config.DescriptionWeight) 178 + embRows.Close() 164 179 165 - if _, err := tx.ExecContext(ctx, descUpdate); err != nil { 166 - return err 180 + for _, p := range pairs { 181 + vecA, okA := embeddings[p.a] 182 + vecB, okB := embeddings[p.b] 183 + if !okA || !okB { 184 + continue 185 + } 186 + sim := cosineSimilarity(vecA, vecB) 187 + if sim <= 0 { 188 + continue 189 + } 190 + boost := sim * e.config.DescriptionWeight 191 + if _, err := tx.ExecContext(ctx, 192 + `UPDATE _feed_sim_staging SET jaccard = jaccard + ? WHERE feed_a = ? AND feed_b = ?`, 193 + boost, p.a, p.b, 194 + ); err != nil { 195 + return err 196 + } 167 197 } 168 198 169 199 return nil 170 200 } 171 201 172 - func (e *Engine) ComputeUserSimilarity(ctx context.Context) error { 173 - tx, err := e.db.BeginTx(ctx, nil) 174 - if err != nil { 175 - return err 202 + func cosineSimilarity(a, b []float32) float64 { 203 + var dot, normA, normB float64 204 + for i := range a { 205 + dot += float64(a[i]) * float64(b[i]) 206 + normA += float64(a[i]) * float64(a[i]) 207 + normB += float64(b[i]) * float64(b[i]) 176 208 } 177 - defer func() { _ = tx.Rollback() }() 178 - 179 - if _, err := tx.ExecContext(ctx, `DELETE FROM recs.user_similarity`); err != nil { 180 - return err 209 + if normA == 0 || normB == 0 { 210 + return 0 181 211 } 212 + return dot / (math.Sqrt(normA) * math.Sqrt(normB)) 213 + } 182 214 183 - _, err = tx.ExecContext(ctx, ` 184 - INSERT INTO recs.user_similarity (user_a, user_b, jaccard, common_feeds) 185 - SELECT 186 - s1.user_did, 187 - s2.user_did, 188 - CAST(COUNT(*) AS REAL) / ( 189 - (SELECT COUNT(*) FROM articles.subscriptions WHERE user_did = s1.user_did) + 190 - (SELECT COUNT(*) FROM articles.subscriptions WHERE user_did = s2.user_did) - 191 - CAST(COUNT(*) AS REAL) 192 - ), 193 - COUNT(*) 194 - FROM articles.subscriptions s1 195 - JOIN articles.subscriptions s2 ON s1.feed_url = s2.feed_url AND s1.user_did < s2.user_did 196 - GROUP BY s1.user_did, s2.user_did 197 - `) 215 + // ComputeUserSimilarity recomputes the user_similarity table: subscription 216 + // Jaccard + time-decayed like co-occurrence + tag overlap + follow boost. 217 + func (e *Engine) ComputeUserSimilarity(ctx context.Context) error { 218 + conn, err := e.db.Conn(ctx) 198 219 if err != nil { 199 220 return err 200 221 } 222 + defer conn.Close() 201 223 202 - if _, err := tx.ExecContext(ctx, ` 203 - CREATE TEMP TABLE IF NOT EXISTS _likes_count (author_did TEXT PRIMARY KEY, cnt INT) 204 - `); err != nil { 205 - return err 206 - } 207 - if _, err := tx.ExecContext(ctx, `DELETE FROM _likes_count`); err != nil { 208 - return err 209 - } 210 - if _, err := tx.ExecContext(ctx, ` 211 - INSERT INTO _likes_count (author_did, cnt) 212 - SELECT author_did, COUNT(*) FROM articles.likes GROUP BY author_did 213 - `); err != nil { 214 - return err 215 - } 224 + { 225 + tx, err := conn.BeginTx(ctx, nil) 226 + if err != nil { 227 + return err 228 + } 229 + defer func() { _ = tx.Rollback() }() 216 230 217 - if _, err := tx.ExecContext(ctx, ` 218 - CREATE TEMP TABLE IF NOT EXISTS _likes_overlap (user_a TEXT, user_b TEXT, common INT, PRIMARY KEY(user_a, user_b)) 219 - `); err != nil { 220 - return err 221 - } 222 - if _, err := tx.ExecContext(ctx, `DELETE FROM _likes_overlap`); err != nil { 223 - return err 224 - } 225 - if _, err := tx.ExecContext(ctx, ` 226 - INSERT INTO _likes_overlap (user_a, user_b, common) 227 - SELECT l1.author_did, l2.author_did, 228 - CAST(SUM( 229 - EXP(-0.023 * CAST(julianday('now') - julianday(l1.created_at) AS REAL)) 230 - * EXP(-0.023 * CAST(julianday('now') - julianday(l2.created_at) AS REAL)) 231 - ) AS INTEGER) 232 - FROM articles.likes l1 233 - JOIN articles.likes l2 ON l1.feed_url = l2.feed_url AND l1.article_url = l2.article_url 234 - AND l1.author_did < l2.author_did 235 - WHERE l1.created_at IS NOT NULL AND l2.created_at IS NOT NULL 236 - GROUP BY l1.author_did, l2.author_did 237 - `); err != nil { 238 - return err 239 - } 231 + if _, err := tx.ExecContext(ctx, `CREATE TEMP TABLE IF NOT EXISTS _user_sim_staging ( 232 + user_a TEXT NOT NULL, user_b TEXT NOT NULL, jaccard REAL NOT NULL, 233 + common_feeds INT NOT NULL DEFAULT 0, common_likes INT NOT NULL DEFAULT 0, common_tags INT NOT NULL DEFAULT 0, 234 + PRIMARY KEY (user_a, user_b))`); err != nil { 235 + return err 236 + } 237 + if _, err := tx.ExecContext(ctx, `DELETE FROM _user_sim_staging`); err != nil { 238 + return err 239 + } 240 240 241 - likesUpdate := fmt.Sprintf(` 242 - UPDATE recs.user_similarity SET 243 - jaccard = jaccard + %g * CAST(_likes_overlap.common AS REAL) / NULLIF( 244 - (SELECT cnt FROM _likes_count WHERE author_did = recs.user_similarity.user_a) + 245 - (SELECT cnt FROM _likes_count WHERE author_did = recs.user_similarity.user_b) - 246 - CAST(_likes_overlap.common AS REAL), 241 + _, err = tx.ExecContext(ctx, ` 242 + INSERT INTO _user_sim_staging (user_a, user_b, jaccard, common_feeds, common_likes, common_tags) 243 + SELECT 244 + s1.user_did, 245 + s2.user_did, 246 + CAST(COUNT(*) AS REAL) / ( 247 + (SELECT COUNT(*) FROM articles.subscriptions WHERE user_did = s1.user_did) + 248 + (SELECT COUNT(*) FROM articles.subscriptions WHERE user_did = s2.user_did) - 249 + CAST(COUNT(*) AS REAL) 250 + ), 251 + COUNT(*), 252 + 0, 247 253 0 248 - ), 249 - common_likes = _likes_overlap.common 250 - FROM _likes_overlap 251 - WHERE recs.user_similarity.user_a = _likes_overlap.user_a 252 - AND recs.user_similarity.user_b = _likes_overlap.user_b 253 - `, e.config.LikesWeight) 254 + FROM articles.subscriptions s1 255 + JOIN articles.subscriptions s2 ON s1.feed_url = s2.feed_url AND s1.user_did < s2.user_did 256 + GROUP BY s1.user_did, s2.user_did 257 + `) 258 + if err != nil { 259 + return err 260 + } 254 261 255 - if _, err := tx.ExecContext(ctx, likesUpdate); err != nil { 256 - return err 257 - } 262 + if _, err := tx.ExecContext(ctx, ` 263 + CREATE TEMP TABLE IF NOT EXISTS _likes_count (author_did TEXT PRIMARY KEY, cnt INT) 264 + `); err != nil { 265 + return err 266 + } 267 + if _, err := tx.ExecContext(ctx, `DELETE FROM _likes_count`); err != nil { 268 + return err 269 + } 270 + if _, err := tx.ExecContext(ctx, ` 271 + INSERT INTO _likes_count (author_did, cnt) 272 + SELECT author_did, COUNT(*) FROM articles.likes GROUP BY author_did 273 + `); err != nil { 274 + return err 275 + } 258 276 259 - likesInsert := fmt.Sprintf(` 260 - INSERT INTO recs.user_similarity (user_a, user_b, jaccard, common_feeds, common_likes) 261 - SELECT sub.user_a, sub.user_b, sub.jaccard, 0, sub.common 262 - FROM ( 263 - SELECT 264 - lo.user_a, 265 - lo.user_b, 266 - %g * CAST(lo.common AS REAL) / NULLIF( 267 - (SELECT cnt FROM _likes_count WHERE author_did = lo.user_a) + 268 - (SELECT cnt FROM _likes_count WHERE author_did = lo.user_b) - 269 - CAST(lo.common AS REAL), 277 + if _, err := tx.ExecContext(ctx, ` 278 + CREATE TEMP TABLE IF NOT EXISTS _likes_overlap (user_a TEXT, user_b TEXT, common INT, PRIMARY KEY(user_a, user_b)) 279 + `); err != nil { 280 + return err 281 + } 282 + if _, err := tx.ExecContext(ctx, `DELETE FROM _likes_overlap`); err != nil { 283 + return err 284 + } 285 + if _, err := tx.ExecContext(ctx, ` 286 + INSERT INTO _likes_overlap (user_a, user_b, common) 287 + SELECT l1.author_did, l2.author_did, 288 + CAST(SUM( 289 + EXP(-0.023 * CAST(julianday('now') - julianday(l1.created_at) AS REAL)) 290 + * EXP(-0.023 * CAST(julianday('now') - julianday(l2.created_at) AS REAL)) 291 + ) AS INTEGER) 292 + FROM articles.likes l1 293 + JOIN articles.likes l2 ON l1.feed_url = l2.feed_url AND l1.article_url = l2.article_url 294 + AND l1.author_did < l2.author_did 295 + WHERE l1.created_at IS NOT NULL AND l2.created_at IS NOT NULL 296 + GROUP BY l1.author_did, l2.author_did 297 + `); err != nil { 298 + return err 299 + } 300 + 301 + likesUpdate := fmt.Sprintf(` 302 + UPDATE _user_sim_staging SET 303 + jaccard = jaccard + %g * CAST(_likes_overlap.common AS REAL) / NULLIF( 304 + (SELECT cnt FROM _likes_count WHERE author_did = _user_sim_staging.user_a) + 305 + (SELECT cnt FROM _likes_count WHERE author_did = _user_sim_staging.user_b) - 306 + CAST(_likes_overlap.common AS REAL), 270 307 0 271 - ) AS jaccard, 272 - lo.common 273 - FROM _likes_overlap lo 274 - ) sub WHERE 1 275 - ON CONFLICT(user_a, user_b) DO UPDATE SET 276 - jaccard = jaccard + excluded.jaccard, 277 - common_likes = excluded.common_likes 278 - `, e.config.LikesWeight) 308 + ), 309 + common_likes = _likes_overlap.common 310 + FROM _likes_overlap 311 + WHERE _user_sim_staging.user_a = _likes_overlap.user_a 312 + AND _user_sim_staging.user_b = _likes_overlap.user_b 313 + `, e.config.LikesWeight) 279 314 280 - if _, err := tx.ExecContext(ctx, likesInsert); err != nil { 281 - return err 282 - } 315 + if _, err := tx.ExecContext(ctx, likesUpdate); err != nil { 316 + return err 317 + } 283 318 284 - if _, err := tx.ExecContext(ctx, `CREATE TEMP TABLE IF NOT EXISTS _tag_overlap (user_a TEXT, user_b TEXT, common INT)`); err != nil { 285 - return err 286 - } 287 - if _, err := tx.ExecContext(ctx, `DELETE FROM _tag_overlap`); err != nil { 288 - return err 289 - } 319 + likesInsert := fmt.Sprintf(` 320 + INSERT INTO _user_sim_staging (user_a, user_b, jaccard, common_feeds, common_likes) 321 + SELECT sub.user_a, sub.user_b, sub.jaccard, 0, sub.common 322 + FROM ( 323 + SELECT 324 + lo.user_a, 325 + lo.user_b, 326 + %g * CAST(lo.common AS REAL) / NULLIF( 327 + (SELECT cnt FROM _likes_count WHERE author_did = lo.user_a) + 328 + (SELECT cnt FROM _likes_count WHERE author_did = lo.user_b) - 329 + CAST(lo.common AS REAL), 330 + 0 331 + ) AS jaccard, 332 + lo.common 333 + FROM _likes_overlap lo 334 + ) sub WHERE 1 335 + ON CONFLICT(user_a, user_b) DO UPDATE SET 336 + jaccard = jaccard + excluded.jaccard, 337 + common_likes = excluded.common_likes 338 + `, e.config.LikesWeight) 290 339 291 - _, err = tx.ExecContext(ctx, ` 292 - INSERT INTO _tag_overlap (user_a, user_b, common) 293 - WITH user_tags AS ( 294 - SELECT author_did, TRIM(value) AS tag FROM articles.annotations, json_each('["' || REPLACE(tags, ',', '","') || '"]') 295 - WHERE tags IS NOT NULL AND tags != '' 296 - ) 297 - SELECT t1.author_did, t2.author_did, COUNT(DISTINCT t1.tag) 298 - FROM user_tags t1 299 - JOIN user_tags t2 ON t1.tag = t2.tag AND t1.author_did < t2.author_did 300 - GROUP BY t1.author_did, t2.author_did 301 - `) 302 - if err != nil { 303 - return err 304 - } 340 + if _, err := tx.ExecContext(ctx, likesInsert); err != nil { 341 + return err 342 + } 305 343 306 - if _, err := tx.ExecContext(ctx, ` 307 - CREATE TEMP TABLE IF NOT EXISTS _tag_count (author_did TEXT PRIMARY KEY, cnt INT) 308 - `); err != nil { 309 - return err 310 - } 311 - if _, err := tx.ExecContext(ctx, `DELETE FROM _tag_count`); err != nil { 312 - return err 313 - } 314 - if _, err := tx.ExecContext(ctx, ` 315 - INSERT INTO _tag_count (author_did, cnt) 316 - WITH user_tags AS ( 317 - SELECT author_did, TRIM(value) AS tag FROM articles.annotations, json_each('["' || REPLACE(tags, ',', '","') || '"]') 318 - WHERE tags IS NOT NULL AND tags != '' 319 - ) 320 - SELECT author_did, COUNT(DISTINCT tag) FROM user_tags GROUP BY author_did 321 - `); err != nil { 322 - return err 323 - } 344 + if _, err := tx.ExecContext(ctx, `CREATE TEMP TABLE IF NOT EXISTS _tag_overlap (user_a TEXT, user_b TEXT, common INT)`); err != nil { 345 + return err 346 + } 347 + if _, err := tx.ExecContext(ctx, `DELETE FROM _tag_overlap`); err != nil { 348 + return err 349 + } 324 350 325 - _, err = tx.ExecContext(ctx, ` 326 - INSERT OR IGNORE INTO recs.user_similarity (user_a, user_b, jaccard, common_feeds, common_tags) 327 - SELECT user_a, user_b, 0, 0, 0 FROM _tag_overlap 328 - `) 329 - if err != nil { 330 - return err 331 - } 351 + _, err = tx.ExecContext(ctx, ` 352 + INSERT INTO _tag_overlap (user_a, user_b, common) 353 + WITH user_tags AS ( 354 + SELECT author_did, TRIM(value) AS tag FROM articles.annotations, json_each('["' || REPLACE(tags, ',', '","') || '"]') 355 + WHERE tags IS NOT NULL AND tags != '' 356 + ) 357 + SELECT t1.author_did, t2.author_did, COUNT(DISTINCT t1.tag) 358 + FROM user_tags t1 359 + JOIN user_tags t2 ON t1.tag = t2.tag AND t1.author_did < t2.author_did 360 + GROUP BY t1.author_did, t2.author_did 361 + `) 362 + if err != nil { 363 + return err 364 + } 332 365 333 - tagsUpdate := fmt.Sprintf(` 334 - UPDATE recs.user_similarity SET 335 - jaccard = jaccard + %g * CAST(_tag_overlap.common AS REAL) / NULLIF( 336 - (SELECT cnt FROM _tag_count WHERE author_did = recs.user_similarity.user_a) + 337 - (SELECT cnt FROM _tag_count WHERE author_did = recs.user_similarity.user_b) - 338 - CAST(_tag_overlap.common AS REAL), 339 - 0 340 - ), 341 - common_tags = _tag_overlap.common 342 - FROM _tag_overlap 343 - WHERE recs.user_similarity.user_a = _tag_overlap.user_a 344 - AND recs.user_similarity.user_b = _tag_overlap.user_b 345 - `, e.config.TagsWeight) 366 + if _, err := tx.ExecContext(ctx, ` 367 + CREATE TEMP TABLE IF NOT EXISTS _tag_count (author_did TEXT PRIMARY KEY, cnt INT) 368 + `); err != nil { 369 + return err 370 + } 371 + if _, err := tx.ExecContext(ctx, `DELETE FROM _tag_count`); err != nil { 372 + return err 373 + } 374 + if _, err := tx.ExecContext(ctx, ` 375 + INSERT INTO _tag_count (author_did, cnt) 376 + WITH user_tags AS ( 377 + SELECT author_did, TRIM(value) AS tag FROM articles.annotations, json_each('["' || REPLACE(tags, ',', '","') || '"]') 378 + WHERE tags IS NOT NULL AND tags != '' 379 + ) 380 + SELECT author_did, COUNT(DISTINCT tag) FROM user_tags GROUP BY author_did 381 + `); err != nil { 382 + return err 383 + } 346 384 347 - if _, err := tx.ExecContext(ctx, tagsUpdate); err != nil { 348 - return err 385 + _, err = tx.ExecContext(ctx, ` 386 + INSERT OR IGNORE INTO _user_sim_staging (user_a, user_b, jaccard, common_feeds, common_tags) 387 + SELECT user_a, user_b, 0, 0, 0 FROM _tag_overlap 388 + `) 389 + if err != nil { 390 + return err 391 + } 392 + 393 + tagsUpdate := fmt.Sprintf(` 394 + UPDATE _user_sim_staging SET 395 + jaccard = jaccard + %g * CAST(_tag_overlap.common AS REAL) / NULLIF( 396 + (SELECT cnt FROM _tag_count WHERE author_did = _user_sim_staging.user_a) + 397 + (SELECT cnt FROM _tag_count WHERE author_did = _user_sim_staging.user_b) - 398 + CAST(_tag_overlap.common AS REAL), 399 + 0 400 + ), 401 + common_tags = _tag_overlap.common 402 + FROM _tag_overlap 403 + WHERE _user_sim_staging.user_a = _tag_overlap.user_a 404 + AND _user_sim_staging.user_b = _tag_overlap.user_b 405 + `, e.config.TagsWeight) 406 + 407 + if _, err := tx.ExecContext(ctx, tagsUpdate); err != nil { 408 + return err 409 + } 410 + 411 + followQuery := fmt.Sprintf(` 412 + INSERT INTO _user_sim_staging (user_a, user_b, jaccard, common_feeds, common_likes, common_tags) 413 + SELECT 414 + MIN(f.user_did, f.target_did), 415 + MAX(f.user_did, f.target_did), 416 + %g, 417 + 0, 0, 0 418 + FROM main.follows f 419 + WHERE f.user_did != f.target_did 420 + GROUP BY MIN(f.user_did, f.target_did), MAX(f.user_did, f.target_did) 421 + ON CONFLICT(user_a, user_b) DO UPDATE SET 422 + jaccard = jaccard + %g 423 + `, e.config.FollowBoost, e.config.FollowBoost) 424 + 425 + if _, err := tx.ExecContext(ctx, followQuery); err != nil { 426 + return err 427 + } 428 + 429 + if err := tx.Commit(); err != nil { 430 + return err 431 + } 349 432 } 350 433 351 - followQuery := fmt.Sprintf(` 352 - INSERT INTO recs.user_similarity (user_a, user_b, jaccard, common_feeds, common_likes, common_tags) 353 - SELECT 354 - MIN(f.user_did, f.target_did), 355 - MAX(f.user_did, f.target_did), 356 - %g, 357 - 0, 0, 0 358 - FROM main.follows f 359 - WHERE f.user_did != f.target_did 360 - GROUP BY MIN(f.user_did, f.target_did), MAX(f.user_did, f.target_did) 361 - ON CONFLICT(user_a, user_b) DO UPDATE SET 362 - jaccard = jaccard + %g 363 - `, e.config.FollowBoost, e.config.FollowBoost) 434 + { 435 + tx, err := conn.BeginTx(ctx, nil) 436 + if err != nil { 437 + return err 438 + } 439 + defer func() { _ = tx.Rollback() }() 364 440 365 - if _, err := tx.ExecContext(ctx, followQuery); err != nil { 366 - return err 367 - } 441 + if _, err := tx.ExecContext(ctx, `DELETE FROM recs.user_similarity`); err != nil { 442 + return err 443 + } 444 + if _, err := tx.ExecContext(ctx, `INSERT INTO recs.user_similarity (user_a, user_b, jaccard, common_feeds, common_likes, common_tags) SELECT user_a, user_b, jaccard, common_feeds, common_likes, common_tags FROM _user_sim_staging`); err != nil { 445 + return err 446 + } 368 447 369 - e.logger.Info("user similarity computed") 370 - return tx.Commit() 448 + e.logger.Info("user similarity computed") 449 + return tx.Commit() 450 + } 371 451 }
+293 -36
internal/cluster/jaccard_test.go
··· 3 3 import ( 4 4 "context" 5 5 "fmt" 6 + "hash/fnv" 7 + "log/slog" 6 8 "os" 9 + "strings" 7 10 "testing" 8 11 12 + vec "github.com/asg017/sqlite-vec-go-bindings/cgo" 9 13 "pkg.rbrt.fr/glean/internal/db" 10 14 11 15 "gotest.tools/v3/assert" 12 - "log/slog" 13 16 ) 14 17 15 18 func setupClusterTestDB(t *testing.T) *db.Databases { 16 19 t.Helper() 20 + vec.Auto() 17 21 f, err := os.CreateTemp("", "glean-cluster-test-*.db") 18 22 assert.NilError(t, err) 19 23 assert.NilError(t, f.Close()) ··· 34 38 dbs, err := db.OpenAll(path) 35 39 assert.NilError(t, err) 36 40 t.Cleanup(func() { _ = dbs.Close() }) 41 + assert.NilError(t, dbs.InitVecTables(8)) 37 42 return dbs 38 43 } 39 44 40 45 func seedClusterData(t *testing.T, ctx context.Context, dbs *db.Databases) { 41 46 t.Helper() 42 47 43 - users := []string{"did:test:alice", "did:test:bob", "did:test:carol"} 48 + users := []string{"did:test:alice", "did:test:bob", "did:test:carol", "did:test:dave"} 44 49 for _, did := range users { 45 50 _, err := dbs.DB().ExecContext(ctx, `INSERT INTO users (did) VALUES (?)`, did) 46 51 assert.NilError(t, err) ··· 80 85 follows := []struct{ user, target string }{ 81 86 {"did:test:alice", "did:test:bob"}, 82 87 {"did:test:bob", "did:test:carol"}, 88 + {"did:test:carol", "did:test:dave"}, 83 89 } 84 90 for _, f := range follows { 85 91 _, err := dbs.DB().ExecContext(ctx, `INSERT OR IGNORE INTO follows (user_did, target_did) VALUES (?, ?)`, f.user, f.target) ··· 88 94 } 89 95 90 96 func newTestEngine(dbs *db.Databases) *Engine { 91 - return NewEngine(dbs.DB(), slog.Default()) 97 + return NewEngine(dbs.DB(), NewMockEmbedder(8), slog.Default()) 92 98 } 93 99 94 100 func TestComputeFeedSimilarity(t *testing.T) { ··· 219 225 220 226 var count int 221 227 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 222 - `SELECT COUNT(*) FROM recs.recommendation_impressions WHERE user_did = 'did:test:alice'`).Scan(&count)) 228 + `SELECT COUNT(*) FROM main.recommendation_impressions WHERE user_did = 'did:test:alice'`).Scan(&count)) 223 229 assert.Equal(t, count, 2) 224 230 225 231 assert.NilError(t, engine.RecordImpressions(ctx, "did:test:alice", impressions)) 226 232 227 233 var shownCount int 228 234 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 229 - `SELECT shown_count FROM recs.recommendation_impressions WHERE user_did = 'did:test:alice' AND target_id = 'https://a.com/feed'`).Scan(&shownCount)) 235 + `SELECT shown_count FROM main.recommendation_impressions WHERE user_did = 'did:test:alice' AND target_id = 'https://a.com/feed'`).Scan(&shownCount)) 230 236 assert.Equal(t, shownCount, 2, "shown_count should increment on repeated impression") 231 237 } 232 238 ··· 244 250 245 251 var acted bool 246 252 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 247 - `SELECT acted FROM recs.recommendation_impressions WHERE user_did = 'did:test:alice' AND target_id = 'https://a.com/feed'`).Scan(&acted)) 253 + `SELECT acted FROM main.recommendation_impressions WHERE user_did = 'did:test:alice' AND target_id = 'https://a.com/feed'`).Scan(&acted)) 248 254 assert.Assert(t, acted, "impression should be marked as acted") 249 255 } 250 256 ··· 257 263 engine := newTestEngine(dbs) 258 264 assert.NilError(t, engine.ComputeFollowDistances(ctx)) 259 265 260 - var d1, d2 int 266 + var d1, d2, d3 int 261 267 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 262 268 `SELECT COUNT(*) FROM recs.follow_distances WHERE distance = 1`).Scan(&d1)) 263 269 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 264 270 `SELECT COUNT(*) FROM recs.follow_distances WHERE distance = 2`).Scan(&d2)) 265 - assert.Assert(t, d1 >= 2, "expected at least 2 direct follow distances") 271 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, 272 + `SELECT COUNT(*) FROM recs.follow_distances WHERE distance = 3`).Scan(&d3)) 273 + assert.Assert(t, d1 >= 3, "expected at least 3 direct follow distances") 266 274 assert.Assert(t, d2 >= 1, "expected at least 1 two-hop distance (alice -> bob -> carol)") 275 + assert.Assert(t, d3 >= 1, "expected at least 1 three-hop distance (alice -> bob -> carol -> dave)") 267 276 268 - var dist int 277 + var exists int 278 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, 279 + `SELECT COUNT(*) FROM recs.follow_distances WHERE user_a = 'did:test:alice' AND user_b = 'did:test:carol'`).Scan(&exists)) 280 + assert.Assert(t, exists == 1, "alice should reach carol") 281 + 269 282 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 270 - `SELECT distance FROM recs.follow_distances WHERE user_a = 'did:test:alice' AND user_b = 'did:test:carol'`).Scan(&dist)) 271 - assert.Equal(t, dist, 2, "alice should be 2 hops from carol") 283 + `SELECT COUNT(*) FROM recs.follow_distances WHERE user_a = 'did:test:alice' AND user_b = 'did:test:dave'`).Scan(&exists)) 284 + assert.Assert(t, exists == 1, "alice should reach dave via 3 hops") 285 + } 286 + 287 + func TestComputeFollowDistancesData_SplitReadWrite(t *testing.T) { 288 + ctx := context.Background() 289 + dbs := setupClusterTestDB(t) 290 + seedClusterData(t, ctx, dbs) 291 + seedFollowData(t, ctx, dbs) 292 + 293 + engine := newTestEngine(dbs) 294 + 295 + sources := []string{"did:test:alice", "did:test:bob", "did:test:carol", "did:test:dave"} 296 + distances, err := engine.ComputeFollowDistancesData(ctx, sources) 297 + assert.NilError(t, err) 298 + assert.Assert(t, len(distances) > 0, "expected follow distance pairs") 299 + 300 + assert.NilError(t, engine.WriteFollowDistances(ctx, distances)) 301 + 302 + var count int 303 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, `SELECT COUNT(*) FROM recs.follow_distances`).Scan(&count)) 304 + assert.Equal(t, count, len(distances)) 272 305 } 273 306 274 307 func TestAutoDismissStale(t *testing.T) { ··· 279 312 engine := newTestEngine(dbs) 280 313 281 314 _, err := dbs.DB().ExecContext(ctx, ` 282 - INSERT INTO recs.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 283 - VALUES ('did:test:alice', 'feed', 'https://stale.com/feed', datetime('now', '-31 days'), datetime('now'), 20, 0) 315 + INSERT INTO main.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 316 + VALUES ('did:test:alice', 'feed', 'https://stale.com/feed', datetime('now', '-6 days'), datetime('now'), 6, 0) 284 317 `) 285 318 assert.NilError(t, err) 286 319 287 - assert.NilError(t, engine.AutoDismissStale(ctx, 15, 30)) 320 + assert.NilError(t, engine.AutoDismissStale(ctx, 5, 5)) 288 321 289 322 dismissed, err := engine.IsFeedDismissed(ctx, "did:test:alice", "https://stale.com/feed") 290 323 assert.NilError(t, err) ··· 299 332 engine := newTestEngine(dbs) 300 333 301 334 _, err := dbs.DB().ExecContext(ctx, ` 302 - INSERT INTO recs.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 335 + INSERT INTO main.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 303 336 VALUES ('did:test:alice', 'feed', 'https://recent.com/feed', datetime('now'), datetime('now'), 5, 0) 304 337 `) 305 338 assert.NilError(t, err) 306 339 307 - assert.NilError(t, engine.AutoDismissStale(ctx, 15, 30)) 340 + assert.NilError(t, engine.AutoDismissStale(ctx, 5, 5)) 308 341 309 342 dismissed, err := engine.IsFeedDismissed(ctx, "did:test:alice", "https://recent.com/feed") 310 343 assert.NilError(t, err) ··· 319 352 engine := newTestEngine(dbs) 320 353 321 354 _, err := dbs.DB().ExecContext(ctx, ` 322 - INSERT INTO recs.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 323 - VALUES ('did:test:alice', 'feed', 'https://acted.com/feed', datetime('now', '-31 days'), datetime('now'), 20, 1) 355 + INSERT INTO main.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 356 + VALUES ('did:test:alice', 'feed', 'https://acted.com/feed', datetime('now', '-6 days'), datetime('now'), 6, 1) 324 357 `) 325 358 assert.NilError(t, err) 326 359 327 - assert.NilError(t, engine.AutoDismissStale(ctx, 15, 30)) 360 + assert.NilError(t, engine.AutoDismissStale(ctx, 5, 5)) 328 361 329 362 dismissed, err := engine.IsFeedDismissed(ctx, "did:test:alice", "https://acted.com/feed") 330 363 assert.NilError(t, err) ··· 384 417 assert.Equal(t, w.WSocial, 0.7) 385 418 assert.Equal(t, w.WPop, 0.2) 386 419 assert.Equal(t, w.WCategory, 0.4) 420 + assert.Equal(t, w.WContent, 0.4) 387 421 } 388 422 389 423 func TestSignalWeights_RewardPenalize(t *testing.T) { ··· 394 428 engine := newTestEngine(dbs) 395 429 396 430 _, err := dbs.DB().ExecContext(ctx, ` 397 - INSERT INTO recs.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 431 + INSERT INTO main.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 398 432 VALUES ('did:test:alice', 'feed', 'https://a.com/feed', datetime('now'), datetime('now'), 1, 1) 399 433 `) 400 434 assert.NilError(t, err) 401 435 for i := range minActionsTune { 402 436 _, err = dbs.DB().ExecContext(ctx, ` 403 - INSERT INTO recs.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 437 + INSERT INTO main.recommendation_impressions (user_did, target_type, target_id, first_shown_at, last_shown_at, shown_count, acted) 404 438 VALUES ('did:test:alice', 'feed', ?, datetime('now'), datetime('now'), 1, 1) 405 439 `, fmt.Sprintf("https://%d.com/feed", i)) 406 440 assert.NilError(t, err) ··· 472 506 473 507 var count int 474 508 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 475 - `SELECT COUNT(*) FROM recs.dismissed_recommendations WHERE user_did = 'did:test:alice' AND target_type = 'article'`).Scan(&count)) 509 + `SELECT COUNT(*) FROM main.dismissed_recommendations WHERE user_did = 'did:test:alice' AND target_type = 'article'`).Scan(&count)) 476 510 assert.Equal(t, count, 1) 477 511 } 478 512 ··· 501 535 502 536 var count int 503 537 assert.NilError(t, dbs.DB().QueryRowContext(ctx, 504 - `SELECT COUNT(*) FROM recs.dismissed_recommendations WHERE user_did = 'did:test:alice' AND target_type = 'feed'`).Scan(&count)) 538 + `SELECT COUNT(*) FROM main.dismissed_recommendations WHERE user_did = 'did:test:alice' AND target_type = 'feed'`).Scan(&count)) 505 539 assert.Equal(t, count, 1, "duplicate dismiss should not create extra rows") 506 540 } 507 541 508 - func TestDescriptionBasedFeedSimilarity(t *testing.T) { 542 + func TestEmbeddingBasedFeedSimilarity(t *testing.T) { 509 543 ctx := context.Background() 510 544 dbs := setupClusterTestDB(t) 511 545 ··· 514 548 _, err = dbs.DB().ExecContext(ctx, `INSERT INTO users (did) VALUES (?)`, "did:test:bob") 515 549 assert.NilError(t, err) 516 550 517 - _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type) VALUES (?, ?, ?, ?, 'rss')`, 551 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type, subscriber_count) VALUES (?, ?, ?, ?, 'rss', 2)`, 518 552 "https://go.com/feed", "Go Blog", "https://go.com", "programming language golang software development") 519 553 assert.NilError(t, err) 520 - _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type) VALUES (?, ?, ?, ?, 'rss')`, 554 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type, subscriber_count) VALUES (?, ?, ?, ?, 'rss', 2)`, 521 555 "https://rust.com/feed", "Rust Blog", "https://rust.com", "programming language rust software development") 522 556 assert.NilError(t, err) 523 557 524 - _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url) VALUES (?, ?)`, "did:test:alice", "https://go.com/feed") 558 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, CURRENT_TIMESTAMP)`, "did:test:alice", "https://go.com/feed") 559 + assert.NilError(t, err) 560 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, CURRENT_TIMESTAMP)`, "did:test:alice", "https://rust.com/feed") 561 + assert.NilError(t, err) 562 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, CURRENT_TIMESTAMP)`, "did:test:bob", "https://go.com/feed") 563 + assert.NilError(t, err) 564 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, CURRENT_TIMESTAMP)`, "did:test:bob", "https://rust.com/feed") 565 + assert.NilError(t, err) 566 + 567 + engine := newTestEngine(dbs) 568 + 569 + assert.NilError(t, engine.ComputeFeedEmbeddings(ctx)) 570 + assert.NilError(t, engine.ComputeFeedSimilarity(ctx)) 571 + 572 + var jaccard float64 573 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, 574 + `SELECT jaccard FROM recs.feed_similarity WHERE feed_a = ? AND feed_b = ?`, 575 + "https://go.com/feed", "https://rust.com/feed").Scan(&jaccard)) 576 + assert.Assert(t, jaccard > 1.0, "embedding cosine similarity should boost feed similarity above pure Jaccard") 577 + } 578 + 579 + type MockEmbedder struct { 580 + dimension int 581 + } 582 + 583 + func NewMockEmbedder(dimension int) *MockEmbedder { 584 + return &MockEmbedder{dimension: dimension} 585 + } 586 + 587 + func (m *MockEmbedder) Embed(_ context.Context, texts []string) ([][]float32, error) { 588 + result := make([][]float32, len(texts)) 589 + for i, text := range texts { 590 + vec := make([]float32, m.dimension) 591 + for word := range strings.FieldsSeq(strings.ToLower(text)) { 592 + if len(word) < 2 { 593 + continue 594 + } 595 + h := fnv.New32a() 596 + h.Write([]byte(word)) 597 + idx := h.Sum32() % uint32(m.dimension) 598 + vec[idx] += 1.0 599 + } 600 + result[i] = vec 601 + } 602 + return result, nil 603 + } 604 + 605 + func (m *MockEmbedder) Dimension() int { 606 + return m.dimension 607 + } 608 + 609 + func TestComputeArticleEmbeddings(t *testing.T) { 610 + ctx := context.Background() 611 + dbs := setupClusterTestDB(t) 612 + 613 + _, err := dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, feed_type, subscriber_count) VALUES (?, ?, ?, 'rss', 1)`, 614 + "https://tech.com/feed", "Tech Feed", "https://tech.com") 615 + assert.NilError(t, err) 616 + 617 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.articles (feed_url, guid, title, summary, url) VALUES (?, ?, ?, ?, ?)`, 618 + "https://tech.com/feed", "1", "golang programming language tutorial", "learn the go programming language for backend development", "https://tech.com/go") 619 + assert.NilError(t, err) 620 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.articles (feed_url, guid, title, summary, url) VALUES (?, ?, ?, ?, ?)`, 621 + "https://tech.com/feed", "2", "rust programming language guide", "learn the rust programming language for systems development", "https://tech.com/rust") 622 + assert.NilError(t, err) 623 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.articles (feed_url, guid, title, summary, url) VALUES (?, ?, ?, ?, ?)`, 624 + "https://tech.com/feed", "3", "cooking recipes for dinner", "easy dinner recipes for the whole family", "https://tech.com/cook") 625 + assert.NilError(t, err) 626 + 627 + engine := newTestEngine(dbs) 628 + 629 + assert.NilError(t, engine.ComputeArticleEmbeddings(ctx)) 630 + 631 + var count int 632 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, `SELECT COUNT(*) FROM recs.article_embeddings`).Scan(&count)) 633 + assert.Equal(t, count, 3, "expected 3 article embeddings") 634 + } 635 + 636 + func TestArticleRecommendationsWithContentBoost(t *testing.T) { 637 + ctx := context.Background() 638 + dbs := setupClusterTestDB(t) 639 + 640 + _, err := dbs.DB().ExecContext(ctx, `INSERT INTO users (did) VALUES (?)`, "did:test:alice") 641 + assert.NilError(t, err) 642 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO users (did) VALUES (?)`, "did:test:bob") 643 + assert.NilError(t, err) 644 + 645 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, feed_type, subscriber_count) VALUES (?, ?, ?, 'rss', 2)`, 646 + "https://tech.com/feed", "Tech Feed", "https://tech.com") 647 + assert.NilError(t, err) 648 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, feed_type, subscriber_count) VALUES (?, ?, ?, 'rss', 2)`, 649 + "https://dev.com/feed", "Dev Feed", "https://dev.com") 650 + assert.NilError(t, err) 651 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, feed_type, subscriber_count) VALUES (?, ?, ?, 'rss', 2)`, 652 + "https://shared.com/feed", "Shared Feed", "https://shared.com") 653 + assert.NilError(t, err) 654 + 655 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url) VALUES (?, ?)`, "did:test:alice", "https://tech.com/feed") 656 + assert.NilError(t, err) 657 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url) VALUES (?, ?)`, "did:test:alice", "https://shared.com/feed") 658 + assert.NilError(t, err) 659 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url) VALUES (?, ?)`, "did:test:bob", "https://dev.com/feed") 525 660 assert.NilError(t, err) 526 - _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url) VALUES (?, ?)`, "did:test:bob", "https://rust.com/feed") 661 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url) VALUES (?, ?)`, "did:test:bob", "https://shared.com/feed") 662 + assert.NilError(t, err) 663 + 664 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.articles (feed_url, guid, title, summary, url, published) VALUES (?, ?, ?, ?, ?, datetime('now'))`, 665 + "https://tech.com/feed", "1", "golang programming tutorial", "learn go programming", "https://tech.com/go") 666 + assert.NilError(t, err) 667 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.articles (feed_url, guid, title, summary, url, published) VALUES (?, ?, ?, ?, ?, datetime('now'))`, 668 + "https://dev.com/feed", "2", "rust programming tutorial", "learn rust programming", "https://dev.com/rust") 669 + assert.NilError(t, err) 670 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.articles (feed_url, guid, title, summary, url, published) VALUES (?, ?, ?, ?, ?, datetime('now'))`, 671 + "https://dev.com/feed", "3", "cooking dinner recipes", "easy dinner recipes", "https://dev.com/cook") 672 + assert.NilError(t, err) 673 + 674 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.likes (uri, author_did, feed_url, article_url, created_at) VALUES (?, ?, ?, ?, datetime('now'))`, 675 + "at://alice/like/1", "did:test:alice", "https://tech.com/feed", "https://tech.com/go") 676 + assert.NilError(t, err) 677 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.likes (uri, author_did, feed_url, article_url, created_at) VALUES (?, ?, ?, ?, datetime('now'))`, 678 + "at://bob/like/1", "did:test:bob", "https://dev.com/feed", "https://dev.com/rust") 527 679 assert.NilError(t, err) 528 680 529 681 engine := newTestEngine(dbs) 682 + 530 683 assert.NilError(t, engine.ComputeFeedSimilarity(ctx)) 684 + assert.NilError(t, engine.ComputeUserSimilarity(ctx)) 685 + assert.NilError(t, engine.ComputeArticleEmbeddings(ctx)) 686 + 687 + recs, err := engine.GetArticleRecommendations(ctx, "did:test:alice", 10) 688 + assert.NilError(t, err) 689 + assert.Assert(t, len(recs) > 0, "alice should get article recommendations") 690 + } 691 + 692 + func TestFeedEmbeddingRecomputedOnDescriptionChange(t *testing.T) { 693 + ctx := context.Background() 694 + dbs := setupClusterTestDB(t) 695 + 696 + _, err := dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type) VALUES (?, ?, ?, ?, 'rss')`, 697 + "https://go.com/feed", "Go Blog", "https://go.com", "old description") 698 + assert.NilError(t, err) 699 + 700 + engine := newTestEngine(dbs) 701 + 702 + assert.NilError(t, engine.ComputeFeedEmbeddings(ctx)) 531 703 532 704 var count int 533 - assert.NilError(t, dbs.DB().QueryRowContext(ctx, `SELECT COUNT(*) FROM recs.feed_similarity`).Scan(&count)) 534 - assert.Assert(t, count >= 0, "description-based similarity should produce pairs") 705 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, `SELECT COUNT(*) FROM recs.feed_embeddings`).Scan(&count)) 706 + assert.Equal(t, count, 1) 535 707 536 - if count > 0 { 537 - var jaccard float64 538 - assert.NilError(t, dbs.DB().QueryRowContext(ctx, 539 - `SELECT jaccard FROM recs.feed_similarity WHERE feed_a = ? AND feed_b = ?`, 540 - "https://go.com/feed", "https://rust.com/feed").Scan(&jaccard)) 541 - assert.Assert(t, jaccard > 0, "description word overlap should boost similarity") 708 + _, err = dbs.DB().ExecContext(ctx, `UPDATE articles.feeds SET description = 'new description' WHERE feed_url = 'https://go.com/feed'`) 709 + assert.NilError(t, err) 710 + 711 + assert.NilError(t, engine.ComputeFeedEmbeddings(ctx)) 712 + 713 + var sourceText string 714 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, `SELECT source_text FROM recs.feed_embedding_meta WHERE feed_url = 'https://go.com/feed'`).Scan(&sourceText)) 715 + assert.Assert(t, sourceText == "Go Blog new description", "embedding should be recomputed when description changes, got: %s", sourceText) 716 + } 717 + 718 + func TestTimeDecayedFeedSimilarity(t *testing.T) { 719 + ctx := context.Background() 720 + dbs := setupClusterTestDB(t) 721 + 722 + _, err := dbs.DB().ExecContext(ctx, `INSERT INTO users (did) VALUES (?)`, "did:test:alice") 723 + assert.NilError(t, err) 724 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO users (did) VALUES (?)`, "did:test:bob") 725 + assert.NilError(t, err) 726 + 727 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, feed_type, subscriber_count) VALUES (?, ?, ?, 'rss', 2)`, 728 + "https://a.com/feed", "Feed A", "https://a.com") 729 + assert.NilError(t, err) 730 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, feed_type, subscriber_count) VALUES (?, ?, ?, 'rss', 2)`, 731 + "https://b.com/feed", "Feed B", "https://b.com") 732 + assert.NilError(t, err) 733 + 734 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, datetime('now', '-60 days'))`, 735 + "did:test:alice", "https://a.com/feed") 736 + assert.NilError(t, err) 737 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, datetime('now'))`, 738 + "did:test:bob", "https://a.com/feed") 739 + assert.NilError(t, err) 740 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, datetime('now'))`, 741 + "did:test:alice", "https://b.com/feed") 742 + assert.NilError(t, err) 743 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url, added_at) VALUES (?, ?, datetime('now'))`, 744 + "did:test:bob", "https://b.com/feed") 745 + assert.NilError(t, err) 746 + 747 + engine := newTestEngine(dbs) 748 + assert.NilError(t, engine.ComputeFeedSimilarity(ctx)) 749 + 750 + var jaccard float64 751 + assert.NilError(t, dbs.DB().QueryRowContext(ctx, 752 + `SELECT jaccard FROM recs.feed_similarity WHERE feed_a = 'https://a.com/feed' AND feed_b = 'https://b.com/feed'`).Scan(&jaccard)) 753 + assert.Assert(t, jaccard > 0, "time-decayed feed similarity should be positive") 754 + assert.Assert(t, jaccard < 1.0, "time decay should reduce similarity below raw Jaccard") 755 + } 756 + 757 + func TestColdStartFromEmbeddings(t *testing.T) { 758 + ctx := context.Background() 759 + dbs := setupClusterTestDB(t) 760 + 761 + _, err := dbs.DB().ExecContext(ctx, `INSERT INTO users (did) VALUES (?)`, "did:test:newuser") 762 + assert.NilError(t, err) 763 + 764 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type, subscriber_count) VALUES (?, ?, ?, ?, 'rss', 2)`, 765 + "https://go.com/feed", "Go Blog", "https://go.com", "golang programming language") 766 + assert.NilError(t, err) 767 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type, subscriber_count) VALUES (?, ?, ?, ?, 'rss', 2)`, 768 + "https://godev.com/feed", "Go Dev", "https://godev.com", "golang development tutorials") 769 + assert.NilError(t, err) 770 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.feeds (feed_url, title, site_url, description, feed_type, subscriber_count) VALUES (?, ?, ?, ?, 'rss', 2)`, 771 + "https://cooking.com/feed", "Cooking", "https://cooking.com", "recipes for dinner") 772 + assert.NilError(t, err) 773 + 774 + _, err = dbs.DB().ExecContext(ctx, `INSERT INTO articles.subscriptions (user_did, feed_url) VALUES (?, ?)`, 775 + "did:test:newuser", "https://go.com/feed") 776 + assert.NilError(t, err) 777 + 778 + engine := newTestEngine(dbs) 779 + assert.NilError(t, engine.ComputeFeedEmbeddings(ctx)) 780 + 781 + recs, err := engine.ColdStartRecommendations(ctx, "did:test:newuser", 10) 782 + assert.NilError(t, err) 783 + assert.Assert(t, len(recs) > 0, "embedding cold start should return recommendations") 784 + 785 + for _, r := range recs { 786 + assert.Assert(t, r.FeedURL != "https://go.com/feed", "should not recommend already subscribed feed") 542 787 } 543 788 } 789 + 790 + func TestNormalizeFeedScores(t *testing.T) { 791 + recs := []*FeedRecommendation{ 792 + {FeedURL: "a", Score: 10.0}, 793 + {FeedURL: "b", Score: 5.0}, 794 + {FeedURL: "c", Score: 1.0}, 795 + } 796 + normalizeFeedScores(recs) 797 + assert.Equal(t, recs[0].Score, 1.0) 798 + assert.Equal(t, recs[1].Score, 4.0/9.0) 799 + assert.Equal(t, recs[2].Score, 0.0) 800 + }
+351 -90
internal/cluster/scoring.go
··· 3 3 import ( 4 4 "context" 5 5 "database/sql" 6 + "fmt" 7 + 8 + vec "github.com/asg017/sqlite-vec-go-bindings/cgo" 6 9 ) 7 10 8 11 type FeedRecommendation struct { ··· 40 43 Score float64 41 44 } 42 45 46 + // GetFeedRecommendations returns feed recommendations for a user. Users with 47 + // fewer than 5 subscriptions get cold-start recommendations (embedding-based 48 + // KNN or graph+popular fallback). Results are min-max normalized and 49 + // diversity-filtered before returning. 43 50 func (e *Engine) GetFeedRecommendations(ctx context.Context, userDID string, limit int) ([]*FeedRecommendation, error) { 44 51 subCount := 0 45 52 _ = e.db.QueryRowContext(ctx, `SELECT COUNT(*) FROM articles.subscriptions WHERE user_did = ?`, userDID).Scan(&subCount) ··· 47 54 if subCount < 5 { 48 55 recs, err := e.ColdStartRecommendations(ctx, userDID, limit*2) 49 56 if err == nil && len(recs) > 0 { 57 + normalizeFeedScores(recs) 50 58 return ApplyDiversity(recs, limit), nil 51 59 } 52 60 } ··· 56 64 return nil, err 57 65 } 58 66 67 + normalizeFeedScores(recs) 59 68 return ApplyDiversity(recs, limit), nil 60 69 } 61 70 71 + // GetPeopleRecommendations returns similar users based on subscription overlap, 72 + // like co-occurrence, tag overlap, and follow relationships. Scores are min-max 73 + // normalized on the Jaccard field. 62 74 func (e *Engine) GetPeopleRecommendations(ctx context.Context, userDID string, limit int) ([]*PersonRecommendation, error) { 63 - return e.ComputePeopleRecommendationsOnDemand(ctx, userDID, limit) 75 + recs, err := e.ComputePeopleRecommendationsOnDemand(ctx, userDID, limit) 76 + if err != nil { 77 + return nil, err 78 + } 79 + normalizePersonScores(recs) 80 + return recs, nil 64 81 } 65 82 83 + // GetArticleRecommendations returns article recommendations combining social 84 + // signals (liked by similar users, followed users' feeds), content similarity 85 + // (embedding KNN against user's liked articles), and recency. Scores are 86 + // min-max normalized. 66 87 func (e *Engine) GetArticleRecommendations(ctx context.Context, userDID string, limit int) ([]*ArticleRecommendation, error) { 67 - return e.ComputeArticleRecommendationsOnDemand(ctx, userDID, limit) 88 + recs, err := e.ComputeArticleRecommendationsOnDemand(ctx, userDID, limit) 89 + if err != nil { 90 + return nil, err 91 + } 92 + normalizeArticleScores(recs) 93 + return recs, nil 68 94 } 69 95 96 + // SignalWeights holds per-signal multipliers used in the recommendation scoring 97 + // formula. Weights are auto-tuned per user via a bandit-style reward/penalty 98 + // system (see weights.go). Each field maps to a column in 99 + // recs.user_signal_weights. 70 100 type SignalWeights struct { 71 101 WSub float64 72 102 WLike float64 ··· 74 104 WSocial float64 75 105 WPop float64 76 106 WCategory float64 107 + WContent float64 77 108 } 78 109 79 110 func defaultWeights() SignalWeights { ··· 84 115 WSocial: 0.7, 85 116 WPop: 0.2, 86 117 WCategory: 0.4, 118 + WContent: 0.4, 87 119 } 88 120 } 89 121 ··· 91 123 w := defaultWeights() 92 124 var dbW SignalWeights 93 125 err := e.db.QueryRowContext(ctx, ` 94 - SELECT w_sub, w_like, w_tag, w_social, w_pop, w_category 126 + SELECT w_sub, w_like, w_tag, w_social, w_pop, w_category, w_content 95 127 FROM recs.user_signal_weights WHERE user_did = ? 96 - `, userDID).Scan(&dbW.WSub, &dbW.WLike, &dbW.WTag, &dbW.WSocial, &dbW.WPop, &dbW.WCategory) 128 + `, userDID).Scan(&dbW.WSub, &dbW.WLike, &dbW.WTag, &dbW.WSocial, &dbW.WPop, &dbW.WCategory, &dbW.WContent) 97 129 if err == nil { 98 130 return dbW 99 131 } ··· 115 147 FROM similar_users su 116 148 JOIN articles.subscriptions s ON s.user_did = su.peer 117 149 WHERE s.feed_url NOT IN (SELECT feed_url FROM articles.subscriptions WHERE user_did = ?) 118 - AND s.feed_url NOT IN (SELECT target_id FROM recs.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 150 + AND s.feed_url NOT IN (SELECT target_id FROM main.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 119 151 GROUP BY s.feed_url 120 152 ), 121 153 like_signals AS ( ··· 125 157 JOIN articles.likes l ON l.author_did = su.peer 126 158 JOIN articles.subscriptions s ON s.feed_url = l.feed_url 127 159 WHERE s.feed_url NOT IN (SELECT feed_url FROM articles.subscriptions WHERE user_did = ?) 128 - AND s.feed_url NOT IN (SELECT target_id FROM recs.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 160 + AND s.feed_url NOT IN (SELECT target_id FROM main.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 129 161 GROUP BY s.feed_url 130 162 ), 131 163 social_boost AS ( 132 164 SELECT s.feed_url, 133 - SUM(CASE WHEN fd.distance = 1 THEN 1.0 ELSE 0.3 END) AS social 165 + SUM(CASE fd.distance WHEN 1 THEN 1.0 WHEN 2 THEN 0.3 WHEN 3 THEN 0.1 ELSE 0 END) AS social 134 166 FROM recs.follow_distances fd 135 167 JOIN articles.subscriptions s ON s.user_did = fd.user_b 136 168 WHERE fd.user_a = ? 137 169 AND s.feed_url NOT IN (SELECT feed_url FROM articles.subscriptions WHERE user_did = ?) 138 - AND s.feed_url NOT IN (SELECT target_id FROM recs.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 170 + AND s.feed_url NOT IN (SELECT target_id FROM main.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 139 171 GROUP BY s.feed_url 140 172 ), 141 173 category_counts AS ( ··· 183 215 return results, rows.Err() 184 216 } 185 217 218 + func normalizeFeedScores(recs []*FeedRecommendation) { 219 + if len(recs) < 2 { 220 + return 221 + } 222 + min, max := recs[0].Score, recs[0].Score 223 + for _, r := range recs[1:] { 224 + if r.Score < min { 225 + min = r.Score 226 + } 227 + if r.Score > max { 228 + max = r.Score 229 + } 230 + } 231 + if max == min { 232 + return 233 + } 234 + span := max - min 235 + for _, r := range recs { 236 + r.Score = (r.Score - min) / span 237 + } 238 + } 239 + 240 + func normalizeArticleScores(recs []*ArticleRecommendation) { 241 + if len(recs) < 2 { 242 + return 243 + } 244 + min, max := recs[0].Score, recs[0].Score 245 + for _, r := range recs[1:] { 246 + if r.Score < min { 247 + min = r.Score 248 + } 249 + if r.Score > max { 250 + max = r.Score 251 + } 252 + } 253 + if max == min { 254 + return 255 + } 256 + span := max - min 257 + for _, r := range recs { 258 + r.Score = (r.Score - min) / span 259 + } 260 + } 261 + 262 + func normalizePersonScores(recs []*PersonRecommendation) { 263 + if len(recs) < 2 { 264 + return 265 + } 266 + min, max := recs[0].Jaccard, recs[0].Jaccard 267 + for _, r := range recs[1:] { 268 + if r.Jaccard < min { 269 + min = r.Jaccard 270 + } 271 + if r.Jaccard > max { 272 + max = r.Jaccard 273 + } 274 + } 275 + if max == min { 276 + return 277 + } 278 + span := max - min 279 + for _, r := range recs { 280 + r.Jaccard = (r.Jaccard - min) / span 281 + } 282 + } 283 + 284 + func (e *Engine) coldStartFromEmbeddings(ctx context.Context, userDID string, limit int) ([]*FeedRecommendation, error) { 285 + if e.embedder == nil { 286 + return nil, nil 287 + } 288 + 289 + conn, err := e.db.Conn(ctx) 290 + if err != nil { 291 + return nil, err 292 + } 293 + defer conn.Close() 294 + 295 + subRows, err := conn.QueryContext(ctx, ` 296 + SELECT fe.feed_url, fe.embedding FROM articles.subscriptions s 297 + JOIN recs.feed_embeddings fe ON fe.feed_url = s.feed_url 298 + WHERE s.user_did = ? 299 + `, userDID) 300 + if err != nil { 301 + return nil, err 302 + } 303 + 304 + dim := e.embedder.Dimension() 305 + sumVec := make([]float32, dim) 306 + subCount := 0 307 + var subFeedURLs []string 308 + for subRows.Next() { 309 + var url string 310 + var blob []byte 311 + if err := subRows.Scan(&url, &blob); err != nil { 312 + subRows.Close() 313 + return nil, err 314 + } 315 + v := deserializeFloat32(blob) 316 + if len(v) != dim { 317 + continue 318 + } 319 + for j := range sumVec { 320 + sumVec[j] += v[j] 321 + } 322 + subCount++ 323 + subFeedURLs = append(subFeedURLs, url) 324 + } 325 + subRows.Close() 326 + 327 + if subCount == 0 { 328 + return nil, nil 329 + } 330 + 331 + avgVec := make([]float32, dim) 332 + for j := range avgVec { 333 + avgVec[j] = sumVec[j] / float32(subCount) 334 + } 335 + 336 + subSet := make(map[string]bool, len(subFeedURLs)) 337 + for _, u := range subFeedURLs { 338 + subSet[u] = true 339 + } 340 + 341 + queryBlob, err := vec.SerializeFloat32(avgVec) 342 + if err != nil { 343 + return nil, fmt.Errorf("serialize query vector: %w", err) 344 + } 345 + 346 + knnRows, err := conn.QueryContext(ctx, ` 347 + SELECT fe.feed_url, fe.distance, COALESCE(f.title, ''), COALESCE(f.site_url, ''), 348 + COALESCE(f.description, ''), f.subscriber_count, COALESCE(f.favicon_url, '') 349 + FROM recs.feed_embeddings fe 350 + JOIN articles.feeds f ON f.feed_url = fe.feed_url 351 + WHERE fe.embedding MATCH ? AND fe.k = ? 352 + ORDER BY fe.distance 353 + `, queryBlob, limit+len(subSet)) 354 + if err != nil { 355 + return nil, err 356 + } 357 + 358 + var results []*FeedRecommendation 359 + for knnRows.Next() { 360 + var r FeedRecommendation 361 + var dist float64 362 + if err := knnRows.Scan(&r.FeedURL, &dist, &r.Title, &r.SiteURL, 363 + &r.Description, &r.SubscriberCount, &r.FaviconURL); err != nil { 364 + knnRows.Close() 365 + return nil, err 366 + } 367 + if subSet[r.FeedURL] { 368 + continue 369 + } 370 + r.Score = 1.0 - dist 371 + if r.Score <= 0 { 372 + continue 373 + } 374 + results = append(results, &r) 375 + if len(results) >= limit { 376 + break 377 + } 378 + } 379 + knnRows.Close() 380 + 381 + return results, nil 382 + } 383 + 186 384 func (e *Engine) ComputeArticleRecommendationsOnDemand(ctx context.Context, userDID string, limit int) ([]*ArticleRecommendation, error) { 187 385 w := e.GetWeights(ctx, userDID) 188 386 189 - rows, err := e.db.QueryContext(ctx, ` 387 + conn, err := e.db.Conn(ctx) 388 + if err != nil { 389 + return nil, err 390 + } 391 + defer conn.Close() 392 + 393 + if err := e.ensureContentBoostTable(ctx, conn); err != nil { 394 + return nil, err 395 + } 396 + 397 + if e.embedder != nil { 398 + if err := e.populateContentBoost(ctx, conn, userDID); err != nil { 399 + e.logger.Warn("content boost failed", "error", err) 400 + } 401 + } 402 + 403 + rows, err := conn.QueryContext(ctx, ` 190 404 WITH similar_users AS ( 191 405 SELECT user_b AS peer, jaccard FROM recs.user_similarity WHERE user_a = ? AND jaccard > 0.15 192 406 UNION ALL ··· 201 415 SELECT 1 FROM articles.likes ul WHERE ul.author_did = ? AND ul.feed_url = l.feed_url AND ul.article_url = l.article_url 202 416 ) 203 417 AND NOT EXISTS ( 204 - SELECT 1 FROM recs.dismissed_recommendations d WHERE d.user_did = ? AND d.target_type = 'article' AND d.target_id = l.article_url 418 + SELECT 1 FROM main.dismissed_recommendations d WHERE d.user_did = ? AND d.target_type = 'article' AND d.target_id = l.article_url 205 419 ) 206 420 GROUP BY l.feed_url, l.article_url 207 421 ), 208 422 social_likes AS ( 209 423 SELECT l.feed_url, l.article_url, 210 - SUM(CASE WHEN fd.distance = 1 THEN 1.0 ELSE 0.3 END) AS social 424 + SUM(CASE fd.distance WHEN 1 THEN 1.0 WHEN 2 THEN 0.3 WHEN 3 THEN 0.1 ELSE 0 END) AS social 211 425 FROM recs.follow_distances fd 212 426 JOIN articles.likes l ON l.author_did = fd.user_b 213 427 WHERE fd.user_a = ? ··· 222 436 COALESCE(rs.is_read, 0), 223 437 COALESCE(la.like_signal, 0) * ? 224 438 + COALESCE(sl.social, 0) * ? 439 + + COALESCE(cb.score, 0) * ? 225 440 + EXP(-0.023 * CAST(julianday('now') - julianday(a.published) AS REAL)) * 0.2 226 441 AS score 227 442 FROM liked_articles la 228 443 JOIN articles.articles a ON a.feed_url = la.feed_url AND a.url = la.article_url 229 444 LEFT JOIN articles.feeds f ON f.feed_url = la.feed_url 230 445 LEFT JOIN social_likes sl ON sl.feed_url = la.feed_url AND sl.article_url = la.article_url 446 + LEFT JOIN _content_boost cb ON cb.article_id = a.id 231 447 LEFT JOIN articles.read_state rs ON rs.article_id = a.id AND rs.user_did = ? 232 448 WHERE COALESCE(rs.is_read, 0) = 0 233 449 ORDER BY score DESC, (CASE WHEN a.published > 'now' THEN 1 ELSE 0 END), a.published DESC 234 450 LIMIT ? 235 - `, userDID, userDID, userDID, userDID, userDID, userDID, w.WLike, w.WSocial, userDID, limit) 451 + `, userDID, userDID, userDID, userDID, userDID, userDID, 452 + w.WLike, w.WSocial, w.WContent, userDID, limit) 236 453 if err != nil { 237 454 return nil, err 238 455 } ··· 282 499 } 283 500 284 501 func (e *Engine) ComputeSignalProfiles(ctx context.Context) error { 285 - tx, err := e.db.BeginTx(ctx, nil) 502 + conn, err := e.db.Conn(ctx) 286 503 if err != nil { 287 504 return err 288 505 } 289 - defer func() { _ = tx.Rollback() }() 506 + defer conn.Close() 290 507 291 - if _, err := tx.ExecContext(ctx, `DELETE FROM recs.user_signal_profiles`); err != nil { 292 - return err 293 - } 508 + { 509 + tx, err := conn.BeginTx(ctx, nil) 510 + if err != nil { 511 + return err 512 + } 513 + defer func() { _ = tx.Rollback() }() 294 514 295 - if _, err := tx.ExecContext(ctx, ` 296 - CREATE TEMP TABLE IF NOT EXISTS _user_like_counts (user_did TEXT PRIMARY KEY, cnt INT) 297 - `); err != nil { 298 - return err 299 - } 300 - if _, err := tx.ExecContext(ctx, `DELETE FROM _user_like_counts`); err != nil { 301 - return err 302 - } 303 - if _, err := tx.ExecContext(ctx, ` 304 - INSERT INTO _user_like_counts SELECT author_did, COUNT(*) FROM articles.likes GROUP BY author_did 305 - `); err != nil { 306 - return err 307 - } 515 + if _, err := tx.ExecContext(ctx, ` 516 + CREATE TEMP TABLE IF NOT EXISTS _user_like_counts (user_did TEXT PRIMARY KEY, cnt INT) 517 + `); err != nil { 518 + return err 519 + } 520 + if _, err := tx.ExecContext(ctx, `DELETE FROM _user_like_counts`); err != nil { 521 + return err 522 + } 523 + if _, err := tx.ExecContext(ctx, ` 524 + INSERT INTO _user_like_counts SELECT author_did, COUNT(*) FROM articles.likes GROUP BY author_did 525 + `); err != nil { 526 + return err 527 + } 308 528 309 - if _, err := tx.ExecContext(ctx, ` 310 - CREATE TEMP TABLE IF NOT EXISTS _user_tag_counts (user_did TEXT PRIMARY KEY, cnt INT) 311 - `); err != nil { 312 - return err 313 - } 314 - if _, err := tx.ExecContext(ctx, `DELETE FROM _user_tag_counts`); err != nil { 315 - return err 316 - } 317 - if _, err := tx.ExecContext(ctx, ` 318 - INSERT INTO _user_tag_counts 319 - WITH user_tags AS ( 320 - SELECT author_did, TRIM(value) AS tag 321 - FROM articles.annotations, json_each('["' || REPLACE(tags, ',', '","') || '"]') 322 - WHERE tags IS NOT NULL AND tags != '' 323 - ) 324 - SELECT author_did, COUNT(DISTINCT tag) FROM user_tags GROUP BY author_did 325 - `); err != nil { 326 - return err 327 - } 529 + if _, err := tx.ExecContext(ctx, ` 530 + CREATE TEMP TABLE IF NOT EXISTS _user_tag_counts (user_did TEXT PRIMARY KEY, cnt INT) 531 + `); err != nil { 532 + return err 533 + } 534 + if _, err := tx.ExecContext(ctx, `DELETE FROM _user_tag_counts`); err != nil { 535 + return err 536 + } 537 + if _, err := tx.ExecContext(ctx, ` 538 + INSERT INTO _user_tag_counts 539 + WITH user_tags AS ( 540 + SELECT author_did, TRIM(value) AS tag 541 + FROM articles.annotations, json_each('["' || REPLACE(tags, ',', '","') || '"]') 542 + WHERE tags IS NOT NULL AND tags != '' 543 + ) 544 + SELECT author_did, COUNT(DISTINCT tag) FROM user_tags GROUP BY author_did 545 + `); err != nil { 546 + return err 547 + } 328 548 329 - if _, err := tx.ExecContext(ctx, ` 330 - CREATE TEMP TABLE IF NOT EXISTS _user_top_categories (user_did TEXT PRIMARY KEY, categories TEXT) 331 - `); err != nil { 332 - return err 333 - } 334 - if _, err := tx.ExecContext(ctx, `DELETE FROM _user_top_categories`); err != nil { 335 - return err 336 - } 337 - if _, err := tx.ExecContext(ctx, ` 338 - INSERT INTO _user_top_categories 339 - SELECT user_did, '[' || GROUP_CONCAT('{"c":"' || category || '","n":"' || CAST(cnt AS TEXT) || '}') || ']' 340 - FROM ( 341 - SELECT user_did, category, COUNT(*) AS cnt 342 - FROM articles.subscriptions 343 - WHERE category IS NOT NULL AND category != '' 344 - GROUP BY user_did, category 345 - ORDER BY COUNT(*) DESC 346 - LIMIT 5 347 - ) 348 - GROUP BY user_did 349 - `); err != nil { 350 - return err 351 - } 549 + if _, err := tx.ExecContext(ctx, ` 550 + CREATE TEMP TABLE IF NOT EXISTS _user_top_categories (user_did TEXT PRIMARY KEY, categories TEXT) 551 + `); err != nil { 552 + return err 553 + } 554 + if _, err := tx.ExecContext(ctx, `DELETE FROM _user_top_categories`); err != nil { 555 + return err 556 + } 557 + if _, err := tx.ExecContext(ctx, ` 558 + INSERT INTO _user_top_categories 559 + SELECT user_did, '[' || GROUP_CONCAT('{"c":"' || category || '","n":"' || CAST(cnt AS TEXT) || '}') || ']' 560 + FROM ( 561 + SELECT user_did, category, COUNT(*) AS cnt 562 + FROM articles.subscriptions 563 + WHERE category IS NOT NULL AND category != '' 564 + GROUP BY user_did, category 565 + ORDER BY COUNT(*) DESC 566 + LIMIT 5 567 + ) 568 + GROUP BY user_did 569 + `); err != nil { 570 + return err 571 + } 352 572 353 - _, err = tx.ExecContext(ctx, ` 354 - INSERT INTO recs.user_signal_profiles (user_did, total_likes, total_tags, top_categories) 355 - SELECT 356 - u.did, 357 - COALESCE(lc.cnt, 0), 358 - COALESCE(tc.cnt, 0), 359 - COALESCE(cc.categories, '[]') 360 - FROM main.users u 361 - LEFT JOIN _user_like_counts lc ON lc.user_did = u.did 362 - LEFT JOIN _user_tag_counts tc ON tc.user_did = u.did 363 - LEFT JOIN _user_top_categories cc ON cc.user_did = u.did 364 - `) 365 - if err != nil { 366 - return err 573 + if _, err := tx.ExecContext(ctx, ` 574 + CREATE TEMP TABLE IF NOT EXISTS _signal_profiles_staging ( 575 + user_did TEXT PRIMARY KEY, total_likes INT, total_tags INT, top_categories TEXT 576 + ) 577 + `); err != nil { 578 + return err 579 + } 580 + if _, err := tx.ExecContext(ctx, `DELETE FROM _signal_profiles_staging`); err != nil { 581 + return err 582 + } 583 + if _, err := tx.ExecContext(ctx, ` 584 + INSERT INTO _signal_profiles_staging (user_did, total_likes, total_tags, top_categories) 585 + SELECT 586 + u.did, 587 + COALESCE(lc.cnt, 0), 588 + COALESCE(tc.cnt, 0), 589 + COALESCE(cc.categories, '[]') 590 + FROM main.users u 591 + LEFT JOIN _user_like_counts lc ON lc.user_did = u.did 592 + LEFT JOIN _user_tag_counts tc ON tc.user_did = u.did 593 + LEFT JOIN _user_top_categories cc ON cc.user_did = u.did 594 + `); err != nil { 595 + return err 596 + } 597 + 598 + if err := tx.Commit(); err != nil { 599 + return err 600 + } 367 601 } 368 602 369 - e.logger.Info("signal profiles computed") 370 - return tx.Commit() 603 + { 604 + tx, err := conn.BeginTx(ctx, nil) 605 + if err != nil { 606 + return err 607 + } 608 + defer func() { _ = tx.Rollback() }() 609 + 610 + if _, err := tx.ExecContext(ctx, `DELETE FROM recs.user_signal_profiles`); err != nil { 611 + return err 612 + } 613 + if _, err := tx.ExecContext(ctx, `INSERT INTO recs.user_signal_profiles (user_did, total_likes, total_tags, top_categories) SELECT user_did, total_likes, total_tags, top_categories FROM _signal_profiles_staging`); err != nil { 614 + return err 615 + } 616 + 617 + e.logger.Info("signal profiles computed") 618 + return tx.Commit() 619 + } 371 620 } 372 621 373 622 func (e *Engine) ColdStartRecommendations(ctx context.Context, userDID string, limit int) ([]*FeedRecommendation, error) { ··· 377 626 return nil, nil 378 627 } 379 628 629 + recs, err := e.coldStartFromEmbeddings(ctx, userDID, limit) 630 + if err != nil { 631 + e.logger.Warn("embedding cold start failed", "error", err) 632 + } 633 + if len(recs) > 0 { 634 + return recs, nil 635 + } 636 + 637 + return e.coldStartFromGraphAndPopular(ctx, userDID, limit) 638 + } 639 + 640 + func (e *Engine) coldStartFromGraphAndPopular(ctx context.Context, userDID string, limit int) ([]*FeedRecommendation, error) { 380 641 rows, err := e.db.QueryContext(ctx, ` 381 642 WITH followed_feeds AS ( 382 643 SELECT s.feed_url, 1.0 AS weight ··· 384 645 JOIN articles.subscriptions s ON s.user_did = fd.user_b 385 646 WHERE fd.user_a = ? AND fd.distance = 1 386 647 AND s.feed_url NOT IN (SELECT feed_url FROM articles.subscriptions WHERE user_did = ?) 387 - AND s.feed_url NOT IN (SELECT target_id FROM recs.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 648 + AND s.feed_url NOT IN (SELECT target_id FROM main.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 388 649 ), 389 650 popular_feeds AS ( 390 651 SELECT feed_url, subscriber_count, ··· 392 653 FROM articles.feeds 393 654 WHERE subscriber_count > 0 394 655 AND feed_url NOT IN (SELECT feed_url FROM articles.subscriptions WHERE user_did = ?) 395 - AND feed_url NOT IN (SELECT target_id FROM recs.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 656 + AND feed_url NOT IN (SELECT target_id FROM main.dismissed_recommendations WHERE user_did = ? AND target_type = 'feed') 396 657 ORDER BY subscriber_count DESC 397 658 LIMIT 50 398 659 ),
+128 -25
internal/cluster/social.go
··· 2 2 3 3 import ( 4 4 "context" 5 + "fmt" 5 6 ) 6 7 7 - func (e *Engine) ComputeFollowDistances(ctx context.Context) error { 8 + const maxFollowDepth = 3 9 + 10 + type followDistance struct { 11 + userA string 12 + userB string 13 + distance int 14 + } 15 + 16 + func (e *Engine) ComputeFollowDistancesData(ctx context.Context, sources []string) ([]followDistance, error) { 17 + if len(sources) == 0 { 18 + return nil, nil 19 + } 20 + 21 + rows, err := e.db.QueryContext(ctx, `SELECT user_did, target_did FROM main.follows WHERE user_did != target_did`) 22 + if err != nil { 23 + return nil, err 24 + } 25 + defer rows.Close() 26 + 27 + adj := make(map[string][]string) 28 + for rows.Next() { 29 + var src, dst string 30 + if err := rows.Scan(&src, &dst); err != nil { 31 + return nil, err 32 + } 33 + adj[src] = append(adj[src], dst) 34 + } 35 + if err := rows.Err(); err != nil { 36 + return nil, err 37 + } 38 + 39 + var result []followDistance 40 + for _, src := range sources { 41 + dist := map[string]int{src: 0} 42 + queue := []string{src} 43 + for len(queue) > 0 { 44 + cur := queue[0] 45 + queue = queue[1:] 46 + d := dist[cur] 47 + if d >= maxFollowDepth { 48 + continue 49 + } 50 + for _, next := range adj[cur] { 51 + if _, ok := dist[next]; !ok { 52 + dist[next] = d + 1 53 + queue = append(queue, next) 54 + } 55 + } 56 + } 57 + for other, d := range dist { 58 + if d > 0 { 59 + result = append(result, followDistance{userA: src, userB: other, distance: d}) 60 + } 61 + } 62 + } 63 + 64 + return result, nil 65 + } 66 + 67 + func (e *Engine) WriteFollowDistances(ctx context.Context, distances []followDistance) error { 8 68 tx, err := e.db.BeginTx(ctx, nil) 9 69 if err != nil { 10 70 return err ··· 15 75 return err 16 76 } 17 77 18 - _, err = tx.ExecContext(ctx, ` 19 - INSERT INTO recs.follow_distances (user_a, user_b, distance) 20 - SELECT user_a, user_b, MIN(distance) FROM ( 21 - SELECT user_did AS user_a, target_did AS user_b, 1 AS distance FROM main.follows WHERE user_did != target_did 22 - UNION ALL 23 - SELECT f1.user_did, f2.target_did, 2 24 - FROM main.follows f1 25 - JOIN main.follows f2 ON f1.target_did = f2.user_did 26 - WHERE f1.user_did != f2.target_did 27 - ) GROUP BY user_a, user_b 28 - `) 78 + stmt, err := tx.PrepareContext(ctx, `INSERT INTO recs.follow_distances (user_a, user_b, distance) VALUES (?, ?, ?)`) 29 79 if err != nil { 30 80 return err 31 81 } 82 + defer stmt.Close() 32 83 33 - e.logger.Info("follow distances computed") 84 + for _, d := range distances { 85 + if _, err := stmt.ExecContext(ctx, d.userA, d.userB, d.distance); err != nil { 86 + return err 87 + } 88 + } 89 + 90 + e.logger.Info("follow distances computed", "pairs", len(distances)) 34 91 return tx.Commit() 35 92 } 36 93 37 - func (e *Engine) ComputeFollowDistancesIncremental(ctx context.Context) error { 38 - var maxFollowed string 39 - err := e.db.QueryRowContext(ctx, ` 40 - SELECT COALESCE(MAX(followed_at), '1970-01-01') FROM main.follows 41 - `).Scan(&maxFollowed) 94 + // ComputeFollowDistances incrementally recomputes follow distances for users 95 + // whose follows changed since the last run, as tracked by the follows_dirty column. 96 + func (e *Engine) ComputeFollowDistances(ctx context.Context) error { 97 + rows, err := e.db.QueryContext(ctx, `SELECT did FROM main.users WHERE follows_dirty = 1`) 42 98 if err != nil { 43 99 return err 44 100 } 45 101 46 - var lastComputed string 47 - err = e.db.QueryRowContext(ctx, ` 48 - SELECT COALESCE(MAX(computed_at), '1970-01-01') FROM recs.user_similarity 49 - `).Scan(&lastComputed) 102 + var dirtyUsers []string 103 + for rows.Next() { 104 + var did string 105 + if err := rows.Scan(&did); err != nil { 106 + rows.Close() 107 + return err 108 + } 109 + dirtyUsers = append(dirtyUsers, did) 110 + } 111 + rows.Close() 112 + 113 + if len(dirtyUsers) == 0 { 114 + return nil 115 + } 116 + 117 + distances, err := e.ComputeFollowDistancesData(ctx, dirtyUsers) 50 118 if err != nil { 51 119 return err 52 120 } 53 121 54 - if maxFollowed <= lastComputed { 55 - return nil 122 + tx, err := e.db.BeginTx(ctx, nil) 123 + if err != nil { 124 + return err 56 125 } 126 + defer func() { _ = tx.Rollback() }() 57 127 58 - return e.ComputeFollowDistances(ctx) 128 + ph := make([]string, len(dirtyUsers)) 129 + args := make([]any, len(dirtyUsers)) 130 + for i, did := range dirtyUsers { 131 + ph[i] = "?" 132 + args[i] = did 133 + } 134 + if _, err := tx.ExecContext(ctx, 135 + fmt.Sprintf("DELETE FROM recs.follow_distances WHERE user_a IN (%s)", joinPh(ph)), 136 + args..., 137 + ); err != nil { 138 + return err 139 + } 140 + 141 + stmt, err := tx.PrepareContext(ctx, `INSERT INTO recs.follow_distances (user_a, user_b, distance) VALUES (?, ?, ?)`) 142 + if err != nil { 143 + return err 144 + } 145 + defer stmt.Close() 146 + 147 + for _, d := range distances { 148 + if _, err := stmt.ExecContext(ctx, d.userA, d.userB, d.distance); err != nil { 149 + return err 150 + } 151 + } 152 + 153 + if _, err := tx.ExecContext(ctx, 154 + fmt.Sprintf("UPDATE main.users SET follows_dirty = 0 WHERE did IN (%s)", joinPh(ph)), 155 + args..., 156 + ); err != nil { 157 + return err 158 + } 159 + 160 + e.logger.Info("follow distances computed", "users", len(dirtyUsers), "pairs", len(distances)) 161 + return tx.Commit() 59 162 }
+11 -5
internal/cluster/weights.go
··· 11 11 minActionsTune = 5 12 12 ) 13 13 14 + // RewardSignal increases the weight of the given signal for a user. Only takes 15 + // effect after minActionsTune positive actions. Signal must be one of: "sub", 16 + // "like", "tag", "social", "pop", "category", "content". 14 17 func (e *Engine) RewardSignal(ctx context.Context, userDID string, signal string) { 15 - return 16 18 e.adjustWeight(ctx, userDID, signal, 1.0) 17 19 } 18 20 21 + // PenalizeSignal decreases the weight of the given signal for a user. 19 22 func (e *Engine) PenalizeSignal(ctx context.Context, userDID string, signal string) { 20 - return 21 23 e.adjustWeight(ctx, userDID, signal, -1.0) 22 24 } 23 25 24 26 func (e *Engine) adjustWeight(ctx context.Context, userDID string, signal string, delta float64) { 25 27 var actedCount int 26 28 _ = e.db.QueryRowContext(ctx, ` 27 - SELECT COUNT(*) FROM recs.recommendation_impressions WHERE user_did = ? AND acted = 1 29 + SELECT COUNT(*) FROM main.recommendation_impressions WHERE user_did = ? AND acted = 1 28 30 `, userDID).Scan(&actedCount) 29 31 if actedCount < minActionsTune { 30 32 return ··· 35 37 36 38 if exists == 0 { 37 39 _, _ = e.db.ExecContext(ctx, ` 38 - INSERT INTO recs.user_signal_weights (user_did, w_sub, w_like, w_tag, w_social, w_pop, w_category) 39 - VALUES (?, 1.0, 0.5, 0.3, 0.7, 0.2, 0.4) 40 + INSERT INTO recs.user_signal_weights (user_did, w_sub, w_like, w_tag, w_social, w_pop, w_category, w_content) 41 + VALUES (?, 1.0, 0.5, 0.3, 0.7, 0.2, 0.4, 0.4) 40 42 `, userDID) 41 43 } 42 44 ··· 68 70 return "w_pop" 69 71 case "category": 70 72 return "w_category" 73 + case "content": 74 + return "w_content" 71 75 default: 72 76 return "" 73 77 } 74 78 } 75 79 80 + // GetDominantSignal returns the signal name with the highest weight. 76 81 func (e *Engine) GetDominantSignal(w SignalWeights) string { 77 82 signals := map[string]float64{ 78 83 "sub": w.WSub, ··· 81 86 "social": w.WSocial, 82 87 "pop": w.WPop, 83 88 "category": w.WCategory, 89 + "content": w.WContent, 84 90 } 85 91 86 92 var best string
+49 -25
internal/db/db.go
··· 1 1 package db 2 2 3 3 import ( 4 + "context" 4 5 "database/sql" 5 6 "fmt" 6 7 "math" ··· 126 127 return nil 127 128 } 128 129 130 + func (d *Databases) InitVecTables(dimension int) error { 131 + if dimension <= 0 { 132 + return nil 133 + } 134 + for _, stmt := range []string{ 135 + fmt.Sprintf(`CREATE VIRTUAL TABLE IF NOT EXISTS recs.feed_embeddings USING vec0(feed_url TEXT PRIMARY KEY, embedding float[%d])`, dimension), 136 + fmt.Sprintf(`CREATE VIRTUAL TABLE IF NOT EXISTS recs.article_embeddings USING vec0(article_id INTEGER PRIMARY KEY, embedding float[%d])`, dimension), 137 + } { 138 + if _, err := d.db.ExecContext(context.Background(), stmt); err != nil { 139 + return fmt.Errorf("create vec0 table: %w", err) 140 + } 141 + } 142 + return nil 143 + } 144 + 129 145 func (d *Databases) DB() *sql.DB { 130 146 return d.db.DB 131 147 } ··· 161 177 `CREATE TABLE IF NOT EXISTS users ( 162 178 did TEXT PRIMARY KEY, 163 179 indexed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 164 - updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP 180 + updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 181 + follows_dirty BOOLEAN NOT NULL DEFAULT 1 165 182 )`, 166 183 167 184 `CREATE TABLE IF NOT EXISTS follows ( ··· 189 206 `CREATE INDEX IF NOT EXISTS idx_follows_target ON follows(target_did)`, 190 207 `CREATE INDEX IF NOT EXISTS idx_follows_uri ON follows(uri)`, 191 208 `CREATE INDEX IF NOT EXISTS idx_follows_followed_at ON follows(followed_at)`, 209 + 210 + `CREATE TABLE IF NOT EXISTS dismissed_recommendations ( 211 + user_did TEXT NOT NULL, 212 + target_type TEXT NOT NULL CHECK(target_type IN ('feed', 'article')), 213 + target_id TEXT NOT NULL, 214 + reason TEXT, 215 + dismissed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 216 + PRIMARY KEY (user_did, target_type, target_id) 217 + )`, 218 + 219 + `CREATE TABLE IF NOT EXISTS recommendation_impressions ( 220 + user_did TEXT NOT NULL, 221 + target_type TEXT NOT NULL CHECK(target_type IN ('feed', 'article')), 222 + target_id TEXT NOT NULL, 223 + first_shown_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 224 + last_shown_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 225 + shown_count INTEGER NOT NULL DEFAULT 1, 226 + acted BOOLEAN NOT NULL DEFAULT 0, 227 + PRIMARY KEY (user_did, target_type, target_id) 228 + )`, 229 + 230 + `CREATE INDEX IF NOT EXISTS idx_dismissed_user_type ON dismissed_recommendations(user_did, target_type)`, 231 + `CREATE INDEX IF NOT EXISTS idx_impressions_user_unacted ON recommendation_impressions(user_did, acted, shown_count)`, 232 + `CREATE INDEX IF NOT EXISTS idx_impressions_last_shown ON recommendation_impressions(last_shown_at)`, 192 233 } 193 234 194 235 var articlesSchema = []string{ ··· 318 359 CHECK(user_a < user_b) 319 360 )`, 320 361 321 - `CREATE TABLE IF NOT EXISTS recs.dismissed_recommendations ( 322 - user_did TEXT NOT NULL, 323 - target_type TEXT NOT NULL CHECK(target_type IN ('feed', 'article')), 324 - target_id TEXT NOT NULL, 325 - reason TEXT, 326 - dismissed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 327 - PRIMARY KEY (user_did, target_type, target_id) 328 - )`, 329 - 330 - `CREATE TABLE IF NOT EXISTS recs.recommendation_impressions ( 331 - user_did TEXT NOT NULL, 332 - target_type TEXT NOT NULL CHECK(target_type IN ('feed', 'article')), 333 - target_id TEXT NOT NULL, 334 - first_shown_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 335 - last_shown_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, 336 - shown_count INTEGER NOT NULL DEFAULT 1, 337 - acted BOOLEAN NOT NULL DEFAULT 0, 338 - PRIMARY KEY (user_did, target_type, target_id) 339 - )`, 340 - 341 362 `CREATE TABLE IF NOT EXISTS recs.follow_distances ( 342 363 user_a TEXT NOT NULL, 343 364 user_b TEXT NOT NULL, 344 - distance INTEGER NOT NULL CHECK(distance IN (1, 2)), 365 + distance INTEGER NOT NULL CHECK(distance IN (1, 2, 3)), 345 366 PRIMARY KEY (user_a, user_b) 346 367 )`, 347 368 ··· 353 374 w_social REAL NOT NULL DEFAULT 0.7, 354 375 w_pop REAL NOT NULL DEFAULT 0.2, 355 376 w_category REAL NOT NULL DEFAULT 0.4, 377 + w_content REAL NOT NULL DEFAULT 0.4, 356 378 updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP 357 379 )`, 358 380 ··· 364 386 updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP 365 387 )`, 366 388 367 - `CREATE INDEX IF NOT EXISTS recs.idx_dismissed_user_type ON dismissed_recommendations(user_did, target_type)`, 368 - `CREATE INDEX IF NOT EXISTS recs.idx_impressions_user_unacted ON recommendation_impressions(user_did, acted, shown_count)`, 369 - `CREATE INDEX IF NOT EXISTS recs.idx_impressions_last_shown ON recommendation_impressions(last_shown_at)`, 389 + `CREATE TABLE IF NOT EXISTS recs.feed_embedding_meta ( 390 + feed_url TEXT PRIMARY KEY, 391 + source_text TEXT NOT NULL DEFAULT '' 392 + )`, 393 + 370 394 `CREATE INDEX IF NOT EXISTS recs.idx_follow_distances_b ON follow_distances(user_b)`, 371 395 `CREATE INDEX IF NOT EXISTS recs.idx_follow_distances_a_dist ON follow_distances(user_a, distance)`, 372 396 `CREATE INDEX IF NOT EXISTS recs.idx_user_similarity_b ON user_similarity(user_b)`,
+22 -1
internal/db/follow.go
··· 22 22 uri = excluded.uri, 23 23 cid = excluded.cid 24 24 `, userDID, targetDID, nilIfEmpty(uri), nilIfEmpty(cid)) 25 + if err != nil { 26 + return err 27 + } 28 + _, err = s.db.ExecContext(ctx, `UPDATE users SET follows_dirty = 1 WHERE did = ?`, userDID) 25 29 return err 26 30 } 27 31 28 32 func (s *UserStore) DeleteFollow(ctx context.Context, userDID, targetDID string) error { 29 33 _, err := s.db.ExecContext(ctx, `DELETE FROM follows WHERE user_did = ? AND target_did = ?`, userDID, targetDID) 34 + if err != nil { 35 + return err 36 + } 37 + _, err = s.db.ExecContext(ctx, `UPDATE users SET follows_dirty = 1 WHERE did = ?`, userDID) 30 38 return err 31 39 } 32 40 33 41 func (s *UserStore) DeleteFollowByURI(ctx context.Context, uri string) error { 34 - _, err := s.db.ExecContext(ctx, `DELETE FROM follows WHERE uri = ?`, uri) 42 + _, err := s.db.ExecContext(ctx, `UPDATE users SET follows_dirty = 1 WHERE did IN (SELECT user_did FROM follows WHERE uri = ?)`, uri) 43 + if err != nil { 44 + return err 45 + } 46 + _, err = s.db.ExecContext(ctx, `DELETE FROM follows WHERE uri = ?`, uri) 35 47 return err 36 48 } 37 49 ··· 139 151 } 140 152 rows.Close() 141 153 154 + var changed bool 142 155 for targetDID := range existing { 143 156 if _, ok := activeFollows[targetDID]; !ok { 144 157 if _, err := tx.ExecContext(ctx, `DELETE FROM follows WHERE user_did = ? AND target_did = ?`, userDID, targetDID); err != nil { 145 158 return err 146 159 } 160 + changed = true 147 161 } 148 162 } 149 163 ··· 163 177 if err != nil { 164 178 return err 165 179 } 180 + changed = true 181 + } 182 + } 183 + 184 + if changed { 185 + if _, err := tx.ExecContext(ctx, `UPDATE users SET follows_dirty = 1 WHERE did = ?`, userDID); err != nil { 186 + return err 166 187 } 167 188 } 168 189
+4
internal/db/include/sqlite3.h
··· 1 + #ifndef SQLITE3_H_BRIDGE 2 + #define SQLITE3_H_BRIDGE 3 + #include "sqlite3-binding.h" 4 + #endif
+11 -10
internal/db/user.go
··· 6 6 ) 7 7 8 8 type User struct { 9 - DID string 10 - Handle string 11 - DisplayName string 12 - AvatarURL string 13 - IndexedAt sql.NullTime 14 - UpdatedAt sql.NullTime 9 + DID string 10 + Handle string 11 + DisplayName string 12 + AvatarURL string 13 + IndexedAt sql.NullTime 14 + UpdatedAt sql.NullTime 15 + FollowsDirty bool 15 16 } 16 17 17 18 type UserStore struct { ··· 60 61 func (s *UserStore) GetUser(ctx context.Context, did string) (*User, error) { 61 62 u := &User{} 62 63 err := s.db.QueryRowContext(ctx, ` 63 - SELECT * FROM users WHERE did = ? 64 - `, did).Scan(&u.DID, &u.IndexedAt, &u.UpdatedAt) 64 + SELECT did, indexed_at, updated_at, follows_dirty FROM users WHERE did = ? 65 + `, did).Scan(&u.DID, &u.IndexedAt, &u.UpdatedAt, &u.FollowsDirty) 65 66 if err != nil { 66 67 return nil, err 67 68 } ··· 88 89 89 90 func (s *UserStore) ListUsers(ctx context.Context) ([]*User, error) { 90 91 rows, err := s.db.QueryContext(ctx, ` 91 - SELECT did, indexed_at, updated_at 92 + SELECT did, indexed_at, updated_at, follows_dirty 92 93 FROM users ORDER BY updated_at DESC 93 94 `) 94 95 if err != nil { ··· 99 100 var users []*User 100 101 for rows.Next() { 101 102 u := &User{} 102 - if err := rows.Scan(&u.DID, &u.IndexedAt, &u.UpdatedAt); err != nil { 103 + if err := rows.Scan(&u.DID, &u.IndexedAt, &u.UpdatedAt, &u.FollowsDirty); err != nil { 103 104 return nil, err 104 105 } 105 106 users = append(users, u)
+21 -1
main.go
··· 17 17 "pkg.rbrt.fr/glean/internal/db" 18 18 "pkg.rbrt.fr/glean/internal/feed" 19 19 "pkg.rbrt.fr/glean/internal/server" 20 + 21 + vec "github.com/asg017/sqlite-vec-go-bindings/cgo" 20 22 ) 21 23 22 24 func main() { ··· 40 42 41 43 logger := slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})) 42 44 45 + vec.Auto() 43 46 dbs, err := db.OpenAll(*dbPath) 44 47 if err != nil { 45 48 logger.Error("failed to open databases", "error", err) ··· 53 56 storeAdapter := db.NewFeedStoreAdapter(dbs.Articles) 54 57 scheduler := feed.NewScheduler(storeAdapter, logger, *fetchInterval, 30*time.Minute) 55 58 56 - engine := cluster.NewEngine(dbs.DB(), logger) 59 + var embedder cluster.Embedder 60 + if embedURL := envOr("GLEAN_EMBED_BASE_URL", ""); embedURL != "" { 61 + embedder = cluster.NewOpenAIEmbedder(cluster.OpenAIEmbedderConfig{ 62 + BaseURL: embedURL, 63 + APIKey: envOr("GLEAN_EMBED_API_KEY", ""), 64 + Model: envOr("GLEAN_EMBED_MODEL", "text-embedding-3-small"), 65 + Dimension: envInt("GLEAN_EMBED_DIMENSION", 1536), 66 + }) 67 + } 68 + 69 + if embedder != nil { 70 + if err := dbs.InitVecTables(embedder.Dimension()); err != nil { 71 + logger.Error("failed to init vec tables", "error", err) 72 + os.Exit(1) 73 + } 74 + } 75 + 76 + engine := cluster.NewEngine(dbs.DB(), embedder, logger) 57 77 58 78 srv := server.New(dbs, clientID, callbackURL, *addr, scheduler, engine, logger, []byte(sessionKey)) 59 79
+18
readme.md
··· 16 16 - OPML import and export 17 17 - Sign in with Bluesky / Atmosphere account — no new account needed 18 18 19 + ## How recommendations work 20 + 21 + Glean looks at what you and other users subscribe to, read, and like to suggest feeds and people you might enjoy. 22 + 23 + **Feed suggestions** come from readers who share your subscriptions. If a lot of people who follow the same blogs as you also follow a blog you haven't seen, that blog shows up as a recommendation. The system also considers which articles you've liked, whether you follow the person on Bluesky, and how popular the feed is overall. 24 + 25 + **People suggestions** are readers whose subscriptions overlap with yours. The more feeds you share, the higher they rank. You also see whether you have any Bluesky follows in common. 26 + 27 + **Dismissals** keep things tidy. If you dismiss a recommendation, it won't come back. If a suggestion sits ignored for more than 5 days, it's automatically removed so newer recommendations can take its place. 28 + 29 + **Cold start.** If you're new and have fewer than five subscriptions, Glean shows feeds from people you follow on Bluesky alongside popular feeds from the community, so there's something to explore right away. 30 + 31 + The system improves over time: as you subscribe to feeds and like articles, Glean learns which signals matter most to you and adjusts accordingly. 32 + 19 33 ## Self-hosting 20 34 21 35 ### Docker ··· 51 65 | `GLEAN_PLC_URL` | `https://didplc.glean.at` | PLC directory URL for DID resolution | 52 66 | `GLEAN_OAUTH_CLIENT_ID` | _(empty)_ | OAuth client metadata URL (leave empty for localhost dev) | 53 67 | `GLEAN_OAUTH_REDIRECT_URL` | _(empty)_ | OAuth redirect URL (leave empty for localhost dev) | 68 + | `GLEAN_EMBED_BASE_URL` | _(empty)_ | Embeddings API base URL (recommended, see below) | 69 + | `GLEAN_EMBED_API_KEY` | _(empty)_ | API key for the embeddings endpoint | 70 + | `GLEAN_EMBED_MODEL` | `text-embedding-3-small` | Embedding model name | 71 + | `GLEAN_EMBED_DIMENSION` | `1536` | Embedding vector dimension | 54 72 55 73 For production: 56 74