this repo has no description
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

palomar: special-case Japanese text indexing using kuromoji (#640)

This is just for posts right now, not profiles (descriptions, display
name, etc).

I'm somewhat confident in the indexing approach (separate duplicate
fields, gated by text detection). And this seems to work ok for simple
cases.

I'm not very confident about all-kanji text and indexing, and mixes of
Japanese and non-english character sets. For example, Japanese and
Korean (CJK), or Japanese and Thai (non-CJK).

One positive thing is that everything is still being indexed in the
regular text fields, using the existing analysis pipeline. So we can
revert the query changes if needed, or improve some corner cases using
query-time-only techniques.

Closes: https://github.com/bluesky-social/indigo/issues/628

authored by

bnewbold and committed by
GitHub
6162f1e2 0a61bf2d

+431 -73
+9
Makefile
··· 43 43 test-interop: ## Run tests, including local interop (requires services running) 44 44 go clean -testcache && go test -tags=localinterop ./... 45 45 46 + .PHONY: test-search 47 + test-search: ## Run tests, including local search indexing (requires services running) 48 + go clean -testcache && go test -tags=localsearch ./... 49 + 46 50 .PHONY: coverage-html 47 51 coverage-html: ## Generate test coverage report and open in browser 48 52 go test ./... -coverpkg=./... -coverprofile=test-coverage.out ··· 77 81 .PHONY: run-postgres 78 82 run-postgres: .env ## Runs a local postgres instance 79 83 docker compose -f cmd/bigsky/docker-compose.yml up -d 84 + 85 + .PHONY: run-dev-opensearch 86 + run-dev-opensearch: .env ## Runs a local opensearch instance 87 + docker build -f cmd/palomar/Dockerfile.opensearch . -t opensearch-palomar 88 + docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=0penSearch-Pal0mar" opensearch-palomar 80 89 81 90 .PHONY: run-dev-relay 82 91 run-dev-relay: .env ## Runs 'bigsky' Relay for local dev
+2 -1
cmd/palomar/Dockerfile.opensearch
··· 1 - FROM opensearchproject/opensearch:2.5.0 1 + FROM opensearchproject/opensearch:2.13.0 2 2 RUN /usr/share/opensearch/bin/opensearch-plugin install --batch analysis-icu 3 + RUN /usr/share/opensearch/bin/opensearch-plugin install --batch analysis-kuromoji
+5 -3
cmd/palomar/README.md
··· 19 19 20 20 Palomar uses environment variables for configuration. 21 21 22 - - `ATP_BGS_HOST`: URL of firehose to subscribe to, either global BGS or individual PDS (default: `wss://bsky.social`) 22 + - `ATP_RELAY_HOST`: URL of firehose to subscribe to, either global Relay or individual PDS (default: `wss://bsky.network`) 23 23 - `ATP_PLC_HOST`: PLC directory for identity lookups (default: `https://plc.directory`) 24 24 - `DATABASE_URL`: connection string for database to persist firehose cursor subscription state 25 25 - `PALOMAR_BIND`: IP/port to have HTTP API listen on (default: `:3999`) ··· 64 64 65 65 ## Development Quickstart 66 66 67 - Run an ephemeral opensearch instance on local port 9200, with SSL disabled, and the `analysis-icu` plugin installed, using docker: 67 + Run an ephemeral opensearch instance on local port 9200, with SSL disabled, and the `analysis-icu` and `analysis-kuromoji` plugins installed, using docker: 68 68 69 69 docker build -f Dockerfile.opensearch . -t opensearch-palomar 70 - docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearch-palomar 70 + 71 + # in any non-development system, obviously change this default password 72 + docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" -e OPENSEARCH_INITIAL_ADMIN_PASSWORD=0penSearch-Pal0mar opensearch-palomar 71 73 72 74 See [README.opensearch.md]() for more Opensearch operational tips. 73 75
+3 -1
cmd/palomar/README.opensearch.md
··· 1 1 2 2 # Basic OpenSearch Operations 3 3 4 - We use OpenSearch version 2.5+, with the `analysis-icu` plugin. This is included automatically on the AWS hosted version of Opensearch, otherwise you need to install: 4 + We use OpenSearch version 2.13+, with the `analysis-icu` and `analysis-kuromoji` plugins. These are included automatically on the AWS hosted version of Opensearch, otherwise you need to install: 5 5 6 6 sudo /usr/share/opensearch/bin/opensearch-plugin install analysis-icu 7 + sudo /usr/share/opensearch/bin/opensearch-plugin install analysis-kuromoji 7 8 sudo service opensearch restart 8 9 9 10 If you are trying to use Elasticsearch 7.10 instead of OpenSearch, you can install the plugin with: 10 11 11 12 sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu 13 + sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-kuromoji 12 14 sudo service elasticsearch restart 13 15 14 16 ## Local Development
+10 -10
cmd/palomar/main.go
··· 65 65 &cli.StringFlag{ 66 66 Name: "elastic-password", 67 67 Usage: "elasticsearch password", 68 - Value: "admin", 68 + Value: "0penSearch-Pal0mar", 69 69 EnvVars: []string{"ES_PASSWORD", "ELASTIC_PASSWORD"}, 70 70 }, 71 71 &cli.StringFlag{ ··· 87 87 EnvVars: []string{"ES_PROFILE_INDEX"}, 88 88 }, 89 89 &cli.StringFlag{ 90 - Name: "atp-bgs-host", 91 - Usage: "hostname and port of BGS to subscribe to", 92 - Value: "wss://bsky.social", 93 - EnvVars: []string{"ATP_BGS_HOST"}, 90 + Name: "atp-relay-host", 91 + Usage: "hostname and port of Relay to subscribe to", 92 + Value: "wss://bsky.network", 93 + EnvVars: []string{"ATP_RELAY_HOST", "ATP_BGS_HOST"}, 94 94 }, 95 95 &cli.StringFlag{ 96 96 Name: "atp-plc-host", ··· 141 141 EnvVars: []string{"PALOMAR_METRICS_LISTEN"}, 142 142 }, 143 143 &cli.IntFlag{ 144 - Name: "bgs-sync-rate-limit", 145 - Usage: "max repo sync (checkout) requests per second to upstream (BGS)", 144 + Name: "relay-sync-rate-limit", 145 + Usage: "max repo sync (checkout) requests per second to upstream (Relay)", 146 146 Value: 8, 147 - EnvVars: []string{"PALOMAR_BGS_SYNC_RATE_LIMIT"}, 147 + EnvVars: []string{"PALOMAR_RELAY_SYNC_RATE_LIMIT", "PALOMAR_BGS_SYNC_RATE_LIMIT"}, 148 148 }, 149 149 &cli.IntFlag{ 150 150 Name: "index-max-concurrency", ··· 233 233 escli, 234 234 &dir, 235 235 search.Config{ 236 - BGSHost: cctx.String("atp-bgs-host"), 236 + RelayHost: cctx.String("atp-relay-host"), 237 237 ProfileIndex: cctx.String("es-profile-index"), 238 238 PostIndex: cctx.String("es-post-index"), 239 239 Logger: logger, 240 - BGSSyncRateLimit: cctx.Int("bgs-sync-rate-limit"), 240 + RelaySyncRateLimit: cctx.Int("relay-sync-rate-limit"), 241 241 IndexMaxConcurrency: cctx.Int("index-max-concurrency"), 242 242 }, 243 243 )
+5 -5
search/firehose.go
··· 58 58 } 59 59 60 60 d := websocket.DefaultDialer 61 - u, err := url.Parse(s.bgshost) 61 + u, err := url.Parse(s.relayHost) 62 62 if err != nil { 63 - return fmt.Errorf("invalid bgshost URI: %w", err) 63 + return fmt.Errorf("invalid relayHost URI: %w", err) 64 64 } 65 65 u.Path = "xrpc/com.atproto.sync.subscribeRepos" 66 66 if cur != 0 { ··· 132 132 return events.HandleRepoStream( 133 133 ctx, con, autoscaling.NewScheduler( 134 134 autoscaling.DefaultAutoscaleSettings(), 135 - s.bgshost, 135 + s.relayHost, 136 136 rsc.EventHandler, 137 137 ), 138 138 ) ··· 150 150 totalErrored := 0 151 151 152 152 for { 153 - resp, err := comatproto.SyncListRepos(ctx, s.bgsxrpc, cursor, limit) 153 + resp, err := comatproto.SyncListRepos(ctx, s.relayClient, cursor, limit) 154 154 if err != nil { 155 155 log.Error("failed to list repos", "err", err) 156 156 time.Sleep(5 * time.Second) ··· 260 260 } 261 261 262 262 func (s *Server) processTooBigCommit(ctx context.Context, evt *comatproto.SyncSubscribeRepos_Commit) error { 263 - repodata, err := comatproto.SyncGetRepo(ctx, s.bgsxrpc, evt.Repo, "") 263 + repodata, err := comatproto.SyncGetRepo(ctx, s.relayClient, evt.Repo, "") 264 264 if err != nil { 265 265 return err 266 266 }
+14
search/japanese.go
··· 1 + package search 2 + 3 + import ( 4 + "regexp" 5 + ) 6 + 7 + // U+3040 - U+30FF: hiragana and katakana (Japanese only) 8 + // U+FF66 - U+FF9F: half-width katakana (Japanese only) 9 + var japaneseRegex = regexp.MustCompile(`[\x{3040}-\x{30ff}\x{ff66}-\x{ff9f}]`) 10 + 11 + // helper to check if an input string contains any Japanese-specific characters (hiragana or katakana). will not trigger on CJK characters which are not specific to Japanese 12 + func containsJapanese(text string) bool { 13 + return japaneseRegex.MatchString(text) 14 + }
+23
search/japanese_test.go
··· 1 + package search 2 + 3 + import ( 4 + "testing" 5 + 6 + "github.com/stretchr/testify/assert" 7 + ) 8 + 9 + func TestJapaneseDetection(t *testing.T) { 10 + assert := assert.New(t) 11 + 12 + assert.False(containsJapanese("")) 13 + assert.False(containsJapanese("basic english")) 14 + assert.False(containsJapanese("basic english")) 15 + 16 + assert.True(containsJapanese("学校から帰って熱いお風呂に入ったら力一杯がんばる")) 17 + assert.True(containsJapanese("パリ")) 18 + assert.True(containsJapanese("ハリー・ポッター")) 19 + assert.True(containsJapanese("some japanese パリ and some english")) 20 + 21 + // CJK, but not japanese-specific 22 + assert.False(containsJapanese("熱力学")) 23 + }
+29
search/post_schema.json
··· 22 22 "tokenizer": "icu_tokenizer", 23 23 "char_filter": [ "icu_normalizer" ], 24 24 "filter": [ "icu_folding" ] 25 + }, 26 + "textJapanese": { 27 + "type": "custom", 28 + "tokenizer": "kuromoji_tokenizer", 29 + "char_filter": [ "icu_normalizer" ], 30 + "filter": [ 31 + "kuromoji_baseform", 32 + "kuromoji_part_of_speech", 33 + "cjk_width", 34 + "ja_stop", 35 + "kuromoji_stemmer", 36 + "lowercase" 37 + ] 38 + }, 39 + "textJapaneseSearch": { 40 + "type": "custom", 41 + "tokenizer": "kuromoji_tokenizer", 42 + "char_filter": [ "icu_normalizer" ], 43 + "filter": [ 44 + "kuromoji_baseform", 45 + "kuromoji_part_of_speech", 46 + "cjk_width", 47 + "ja_stop", 48 + "kuromoji_stemmer", 49 + "lowercase" 50 + ] 25 51 } 26 52 }, 27 53 "normalizer": { ··· 49 75 50 76 "created_at": { "type": "date" }, 51 77 "text": { "type": "text", "analyzer": "textIcu", "search_analyzer": "textIcuSearch", "copy_to": "everything" }, 78 + "text_ja": { "type": "text", "analyzer": "textJapanese", "search_analyzer": "textJapaneseSearch", "copy_to": "everything_ja" }, 52 79 "lang_code": { "type": "keyword", "normalizer": "default" }, 53 80 "lang_code_iso2": { "type": "keyword", "normalizer": "default" }, 54 81 "mention_did": { "type": "keyword", "normalizer": "default" }, ··· 58 85 "reply_root_aturi": { "type": "keyword", "normalizer": "default" }, 59 86 "embed_img_count": { "type": "integer" }, 60 87 "embed_img_alt_text": { "type": "text", "analyzer": "textIcu", "search_analyzer": "textIcuSearch", "copy_to": "everything" }, 88 + "embed_img_alt_text_ja": { "type": "text", "analyzer": "textJapanese", "search_analyzer": "textJapaneseSearch", "copy_to": "everything_ja" }, 61 89 "self_label": { "type": "keyword", "normalizer": "default" }, 62 90 63 91 "tag": { "type": "keyword", "normalizer": "default" }, 64 92 "emoji": { "type": "keyword", "normalizer": "caseSensitive" }, 65 93 66 94 "everything": { "type": "text", "analyzer": "textIcu", "search_analyzer": "textIcuSearch" }, 95 + "everything_ja": { "type": "text", "analyzer": "textJapanese", "search_analyzer": "textJapaneseSearch" }, 67 96 68 97 "lang": { "type": "alias", "path": "lang_code_iso2" } 69 98 }
+5 -1
search/query.go
··· 64 64 return nil, err 65 65 } 66 66 queryStr, filters := ParseQuery(ctx, dir, q) 67 + idx := "everything" 68 + if containsJapanese(queryStr) { 69 + idx = "everything_ja" 70 + } 67 71 basic := map[string]interface{}{ 68 72 "simple_query_string": map[string]interface{}{ 69 73 "query": queryStr, 70 - "fields": []string{"everything"}, 74 + "fields": []string{idx}, 71 75 "flags": "AND|NOT|OR|PHRASE|PRECEDENCE|WHITESPACE", 72 76 "default_operator": "and", 73 77 "lenient": true,
+202
search/query_test.go
··· 1 + //go:build localsearch 2 + 3 + package search 4 + 5 + import ( 6 + "context" 7 + "crypto/tls" 8 + "io" 9 + "log/slog" 10 + "net/http" 11 + "testing" 12 + 13 + appbsky "github.com/bluesky-social/indigo/api/bsky" 14 + "github.com/bluesky-social/indigo/atproto/identity" 15 + "github.com/bluesky-social/indigo/atproto/syntax" 16 + 17 + "github.com/ipfs/go-cid" 18 + es "github.com/opensearch-project/opensearch-go/v2" 19 + "github.com/stretchr/testify/assert" 20 + "gorm.io/driver/sqlite" 21 + "gorm.io/gorm" 22 + ) 23 + 24 + var ( 25 + testPostIndex = "palomar_test_post" 26 + testProfileIndex = "palomar_test_profile" 27 + ) 28 + 29 + func testEsClient(t *testing.T) *es.Client { 30 + cfg := es.Config{ 31 + Addresses: []string{"http://localhost:9200"}, 32 + Username: "admin", 33 + Password: "0penSearch-Pal0mar", 34 + CACert: nil, 35 + Transport: &http.Transport{ 36 + MaxIdleConnsPerHost: 5, 37 + TLSClientConfig: &tls.Config{ 38 + InsecureSkipVerify: true, 39 + }, 40 + }, 41 + } 42 + escli, err := es.NewClient(cfg) 43 + if err != nil { 44 + t.Fatal(err) 45 + } 46 + info, err := escli.Info() 47 + if err != nil { 48 + t.Fatal(err) 49 + } 50 + info.Body.Close() 51 + return escli 52 + 53 + } 54 + 55 + func testServer(ctx context.Context, t *testing.T, escli *es.Client, dir identity.Directory) *Server { 56 + db, err := gorm.Open(sqlite.Open("file::memory:?cache=shared"), &gorm.Config{}) 57 + if err != nil { 58 + t.Fatal(err) 59 + } 60 + 61 + srv, err := NewServer( 62 + db, 63 + escli, 64 + dir, 65 + Config{ 66 + RelayHost: "wss://relay.invalid", 67 + PostIndex: testPostIndex, 68 + ProfileIndex: testProfileIndex, 69 + Logger: slog.Default(), 70 + RelaySyncRateLimit: 1, 71 + IndexMaxConcurrency: 1, 72 + }, 73 + ) 74 + if err != nil { 75 + t.Fatal(err) 76 + } 77 + 78 + // NOTE: skipping errors 79 + resp, _ := srv.escli.Indices.Delete([]string{testPostIndex, testProfileIndex}) 80 + defer resp.Body.Close() 81 + io.ReadAll(resp.Body) 82 + 83 + if err := srv.EnsureIndices(ctx); err != nil { 84 + t.Fatal(err) 85 + } 86 + 87 + return srv 88 + } 89 + 90 + func TestJapaneseRegressions(t *testing.T) { 91 + assert := assert.New(t) 92 + ctx := context.Background() 93 + escli := testEsClient(t) 94 + dir := identity.NewMockDirectory() 95 + srv := testServer(ctx, t, escli, &dir) 96 + ident := identity.Identity{ 97 + DID: syntax.DID("did:plc:abc111"), 98 + Handle: syntax.Handle("handle.example.com"), 99 + } 100 + 101 + res, err := DoSearchPosts(ctx, &dir, escli, testPostIndex, "english", 0, 20) 102 + if err != nil { 103 + t.Fatal(err) 104 + } 105 + assert.Equal(0, len(res.Hits.Hits)) 106 + 107 + p1 := appbsky.FeedPost{Text: "basic english post", CreatedAt: "2024-01-02T03:04:05.006Z"} 108 + assert.NoError(srv.indexPost(ctx, &ident, &p1, "app.bsky.feed.post/3kpnillluoh2y", cid.Undef)) 109 + 110 + // https://github.com/bluesky-social/indigo/issues/302 111 + p2 := appbsky.FeedPost{Text: "学校から帰って熱いお風呂に入ったら力一杯がんばる", CreatedAt: "2024-01-02T03:04:05.006Z"} 112 + assert.NoError(srv.indexPost(ctx, &ident, &p2, "app.bsky.feed.post/3kpnillluo222", cid.Undef)) 113 + p3 := appbsky.FeedPost{Text: "熱力学", CreatedAt: "2024-01-02T03:04:05.006Z"} 114 + assert.NoError(srv.indexPost(ctx, &ident, &p3, "app.bsky.feed.post/3kpnillluo333", cid.Undef)) 115 + p4 := appbsky.FeedPost{Text: "東京都", CreatedAt: "2024-01-02T03:04:05.006Z"} 116 + assert.NoError(srv.indexPost(ctx, &ident, &p4, "app.bsky.feed.post/3kpnillluo444", cid.Undef)) 117 + p5 := appbsky.FeedPost{Text: "京都", CreatedAt: "2024-01-02T03:04:05.006Z"} 118 + assert.NoError(srv.indexPost(ctx, &ident, &p5, "app.bsky.feed.post/3kpnillluo555", cid.Undef)) 119 + p6 := appbsky.FeedPost{Text: "パリ", CreatedAt: "2024-01-02T03:04:05.006Z"} 120 + assert.NoError(srv.indexPost(ctx, &ident, &p6, "app.bsky.feed.post/3kpnillluo666", cid.Undef)) 121 + p7 := appbsky.FeedPost{Text: "ハリー・ポッター", CreatedAt: "2024-01-02T03:04:05.006Z"} 122 + assert.NoError(srv.indexPost(ctx, &ident, &p7, "app.bsky.feed.post/3kpnillluo777", cid.Undef)) 123 + p8 := appbsky.FeedPost{Text: "ハリ", CreatedAt: "2024-01-02T03:04:05.006Z"} 124 + assert.NoError(srv.indexPost(ctx, &ident, &p8, "app.bsky.feed.post/3kpnillluo223", cid.Undef)) 125 + p9 := appbsky.FeedPost{Text: "multilingual 多言語", CreatedAt: "2024-01-02T03:04:05.006Z"} 126 + assert.NoError(srv.indexPost(ctx, &ident, &p9, "app.bsky.feed.post/3kpnillluo224", cid.Undef)) 127 + 128 + _, err = srv.escli.Indices.Refresh() 129 + assert.NoError(err) 130 + 131 + // expect all to be indexed 132 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "*", 0, 20) 133 + if err != nil { 134 + t.Fatal(err) 135 + } 136 + assert.Equal(9, len(res.Hits.Hits)) 137 + 138 + // check that english matches (single post) 139 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "english", 0, 20) 140 + if err != nil { 141 + t.Fatal(err) 142 + } 143 + assert.Equal(1, len(res.Hits.Hits)) 144 + 145 + // "thermodynamics"; should return only one match 146 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "熱力学", 0, 20) 147 + if err != nil { 148 + t.Fatal(err) 149 + } 150 + assert.Equal(1, len(res.Hits.Hits)) 151 + 152 + // "Kyoto"; should return only one match 153 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "京都", 0, 20) 154 + if err != nil { 155 + t.Fatal(err) 156 + } 157 + assert.Equal(1, len(res.Hits.Hits)) 158 + 159 + // "Paris"; should return only one match 160 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "パリ", 0, 20) 161 + if err != nil { 162 + t.Fatal(err) 163 + } 164 + assert.Equal(1, len(res.Hits.Hits)) 165 + 166 + // should return only one match 167 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "ハリー", 0, 20) 168 + if err != nil { 169 + t.Fatal(err) 170 + } 171 + assert.Equal(1, len(res.Hits.Hits)) 172 + 173 + // part of a word; should match none 174 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "ハ", 0, 20) 175 + if err != nil { 176 + t.Fatal(err) 177 + } 178 + assert.Equal(0, len(res.Hits.Hits)) 179 + 180 + // should match both ways, and together 181 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "multilingual", 0, 20) 182 + if err != nil { 183 + t.Fatal(err) 184 + } 185 + assert.Equal(1, len(res.Hits.Hits)) 186 + 187 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "多言語", 0, 20) 188 + if err != nil { 189 + t.Fatal(err) 190 + } 191 + assert.Equal(1, len(res.Hits.Hits)) 192 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "multilingual 多言語", 0, 20) 193 + if err != nil { 194 + t.Fatal(err) 195 + } 196 + assert.Equal(1, len(res.Hits.Hits)) 197 + res, err = DoSearchPosts(ctx, &dir, escli, testPostIndex, "\"multilingual 多言語\"", 0, 20) 198 + if err != nil { 199 + t.Fatal(err) 200 + } 201 + assert.Equal(1, len(res.Hits.Hits)) 202 + }
+21 -17
search/server.go
··· 29 29 postIndex string 30 30 profileIndex string 31 31 db *gorm.DB 32 - bgshost string 33 - bgsxrpc *xrpc.Client 32 + relayHost string 33 + relayClient *xrpc.Client 34 34 dir identity.Directory 35 35 echo *echo.Echo 36 36 logger *slog.Logger ··· 47 47 } 48 48 49 49 type Config struct { 50 - BGSHost string 50 + RelayHost string 51 51 ProfileIndex string 52 52 PostIndex string 53 53 Logger *slog.Logger 54 - BGSSyncRateLimit int 54 + RelaySyncRateLimit int 55 55 IndexMaxConcurrency int 56 56 DiscoverRepos bool 57 57 } ··· 68 68 db.AutoMigrate(&LastSeq{}) 69 69 db.AutoMigrate(&backfill.GormDBJob{}) 70 70 71 - bgsws := config.BGSHost 72 - if !strings.HasPrefix(bgsws, "ws") { 73 - return nil, fmt.Errorf("specified bgs host must include 'ws://' or 'wss://'") 71 + relayws := config.RelayHost 72 + if !strings.HasPrefix(relayws, "ws") { 73 + return nil, fmt.Errorf("specified relay host must include 'ws://' or 'wss://'") 74 74 } 75 75 76 - bgshttp := strings.Replace(bgsws, "ws", "http", 1) 77 - bgsxrpc := &xrpc.Client{ 78 - Host: bgshttp, 76 + relayhttp := strings.Replace(relayws, "ws", "http", 1) 77 + relayClient := &xrpc.Client{ 78 + Host: relayhttp, 79 79 } 80 80 81 81 s := &Server{ ··· 83 83 profileIndex: config.ProfileIndex, 84 84 postIndex: config.PostIndex, 85 85 db: db, 86 - bgshost: config.BGSHost, // NOTE: the original URL, not 'bgshttp' 87 - bgsxrpc: bgsxrpc, 86 + relayHost: config.RelayHost, // NOTE: the original URL, not 'relayhttp' 87 + relayClient: relayClient, 88 88 dir: dir, 89 89 logger: logger, 90 90 enableRepoDiscovery: config.DiscoverRepos, ··· 92 92 93 93 bfstore := backfill.NewGormstore(db) 94 94 opts := backfill.DefaultBackfillOptions() 95 - if config.BGSSyncRateLimit > 0 { 96 - opts.SyncRequestsPerSecond = config.BGSSyncRateLimit 97 - opts.ParallelBackfills = 2 * config.BGSSyncRateLimit 95 + if config.RelaySyncRateLimit > 0 { 96 + opts.SyncRequestsPerSecond = config.RelaySyncRateLimit 97 + opts.ParallelBackfills = 2 * config.RelaySyncRateLimit 98 98 } else { 99 99 opts.SyncRequestsPerSecond = 8 100 100 } 101 - opts.CheckoutPath = fmt.Sprintf("%s/xrpc/com.atproto.sync.getRepo", bgshttp) 101 + opts.CheckoutPath = fmt.Sprintf("%s/xrpc/com.atproto.sync.getRepo", relayhttp) 102 102 if config.IndexMaxConcurrency > 0 { 103 103 opts.ParallelRecordCreates = config.IndexMaxConcurrency 104 104 } else { ··· 158 158 return err 159 159 } 160 160 defer resp.Body.Close() 161 - io.ReadAll(resp.Body) 161 + errBytes, err := io.ReadAll(resp.Body) 162 162 if resp.IsError() { 163 + s.logger.Error("failed to create index", "index", idx.Name, "response", string(errBytes)) 163 164 return fmt.Errorf("failed to create index") 165 + } 166 + if err != nil { 167 + return err 164 168 } 165 169 } 166 170 }
+57
search/testdata/transform-post-fixtures.json
··· 186 186 ], 187 187 "embed_img_count": 2 188 188 } 189 + }, 190 + { 191 + "did": "did:plc:u5cwb2mwiv2bfq53cjufe6yn", 192 + "handle": "handle.example.com", 193 + "rkey": "3k4duaz5vfs2b", 194 + "cid": "bafyreibjifzpqj6o6wcq3hejh7y4z4z2vmiklkvykc57tw3pcbx3kxifpm", 195 + "PostRecord": { 196 + "$type": "app.bsky.feed.post", 197 + "text": "学校から帰って熱いお風呂に入ったら力一杯がんばる", 198 + "createdAt": "2023-08-07T05:46:14.423045Z", 199 + "embed": { 200 + "$type": "app.bsky.embed.images", 201 + "images": [ 202 + { 203 + "alt": "brief alt text description of the first image ハリー・ポッター", 204 + "image": { 205 + "$type": "blob", 206 + "ref": { 207 + "$link": "bafkreibabalobzn6cd366ukcsjycp4yymjymgfxcv6xczmlgpemzkz3cfa" 208 + }, 209 + "mimeType": "image/webp", 210 + "size": 760898 211 + } 212 + }, 213 + { 214 + "alt": "brief alt text description of the second image", 215 + "image": { 216 + "$type": "blob", 217 + "ref": { 218 + "$link": "bafkreif3fouono2i3fmm5moqypwskh3yjtp7snd5hfq5pr453oggygyrte" 219 + }, 220 + "mimeType": "image/png", 221 + "size": 13208 222 + } 223 + } 224 + ] 225 + } 226 + }, 227 + "doc_id": "did:plc:u5cwb2mwiv2bfq53cjufe6yn_3k4duaz5vfs2b", 228 + "PostDoc": { 229 + "doc_index_ts": "2006-01-02T15:04:05.000Z", 230 + "did": "did:plc:u5cwb2mwiv2bfq53cjufe6yn", 231 + "handle": "handle.example.com", 232 + "record_rkey": "3k4duaz5vfs2b", 233 + "record_cid": "bafyreibjifzpqj6o6wcq3hejh7y4z4z2vmiklkvykc57tw3pcbx3kxifpm", 234 + "created_at": "2023-08-07T05:46:14.423045Z", 235 + "text": "学校から帰って熱いお風呂に入ったら力一杯がんばる", 236 + "text_ja": "学校から帰って熱いお風呂に入ったら力一杯がんばる", 237 + "embed_img_alt_text": [ 238 + "brief alt text description of the first image ハリー・ポッター", 239 + "brief alt text description of the second image" 240 + ], 241 + "embed_img_alt_text_ja": [ 242 + "brief alt text description of the first image ハリー・ポッター" 243 + ], 244 + "embed_img_count": 2 245 + } 189 246 } 190 247 ]
+46 -35
search/transform.go
··· 28 28 } 29 29 30 30 type PostDoc struct { 31 - DocIndexTs string `json:"doc_index_ts"` 32 - DID string `json:"did"` 33 - RecordRkey string `json:"record_rkey"` 34 - RecordCID string `json:"record_cid"` 35 - CreatedAt *string `json:"created_at,omitempty"` 36 - Text string `json:"text"` 37 - LangCode []string `json:"lang_code,omitempty"` 38 - LangCodeIso2 []string `json:"lang_code_iso2,omitempty"` 39 - MentionDID []string `json:"mention_did,omitempty"` 40 - LinkURL []string `json:"link_url,omitempty"` 41 - EmbedURL *string `json:"embed_url,omitempty"` 42 - EmbedATURI *string `json:"embed_aturi,omitempty"` 43 - ReplyRootATURI *string `json:"reply_root_aturi,omitempty"` 44 - EmbedImgCount int `json:"embed_img_count"` 45 - EmbedImgAltText []string `json:"embed_img_alt_text,omitempty"` 46 - SelfLabel []string `json:"self_label,omitempty"` 47 - Tag []string `json:"tag,omitempty"` 48 - Emoji []string `json:"emoji,omitempty"` 31 + DocIndexTs string `json:"doc_index_ts"` 32 + DID string `json:"did"` 33 + RecordRkey string `json:"record_rkey"` 34 + RecordCID string `json:"record_cid"` 35 + CreatedAt *string `json:"created_at,omitempty"` 36 + Text string `json:"text"` 37 + TextJA string `json:"text_ja,omitempty"` 38 + LangCode []string `json:"lang_code,omitempty"` 39 + LangCodeIso2 []string `json:"lang_code_iso2,omitempty"` 40 + MentionDID []string `json:"mention_did,omitempty"` 41 + LinkURL []string `json:"link_url,omitempty"` 42 + EmbedURL *string `json:"embed_url,omitempty"` 43 + EmbedATURI *string `json:"embed_aturi,omitempty"` 44 + ReplyRootATURI *string `json:"reply_root_aturi,omitempty"` 45 + EmbedImgCount int `json:"embed_img_count"` 46 + EmbedImgAltText []string `json:"embed_img_alt_text,omitempty"` 47 + EmbedImgAltTextJA []string `json:"embed_img_alt_text_ja,omitempty"` 48 + SelfLabel []string `json:"self_label,omitempty"` 49 + Tag []string `json:"tag,omitempty"` 50 + Emoji []string `json:"emoji,omitempty"` 49 51 } 50 52 51 53 // Returns the search index document ID (`_id`) for this document. ··· 143 145 } 144 146 var embedImgCount int = 0 145 147 var embedImgAltText []string 148 + var embedImgAltTextJA []string 146 149 if post.Embed != nil && post.Embed.EmbedImages != nil { 147 150 embedImgCount = len(post.Embed.EmbedImages.Images) 148 151 for _, img := range post.Embed.EmbedImages.Images { 149 152 if img.Alt != "" { 150 153 embedImgAltText = append(embedImgAltText, img.Alt) 154 + if containsJapanese(img.Alt) { 155 + embedImgAltTextJA = append(embedImgAltTextJA, img.Alt) 156 + } 151 157 } 152 158 } 153 159 } ··· 159 165 } 160 166 161 167 doc := PostDoc{ 162 - DocIndexTs: syntax.DatetimeNow().String(), 163 - DID: ident.DID.String(), 164 - RecordRkey: rkey, 165 - RecordCID: cid, 166 - Text: post.Text, 167 - LangCode: post.Langs, 168 - LangCodeIso2: langCodeIso2, 169 - MentionDID: mentionDIDs, 170 - LinkURL: linkURLs, 171 - EmbedURL: embedURL, 172 - EmbedATURI: embedATURI, 173 - ReplyRootATURI: replyRootATURI, 174 - EmbedImgCount: embedImgCount, 175 - EmbedImgAltText: embedImgAltText, 176 - SelfLabel: selfLabels, 177 - Tag: parsePostTags(post), 178 - Emoji: parseEmojis(post.Text), 168 + DocIndexTs: syntax.DatetimeNow().String(), 169 + DID: ident.DID.String(), 170 + RecordRkey: rkey, 171 + RecordCID: cid, 172 + Text: post.Text, 173 + LangCode: post.Langs, 174 + LangCodeIso2: langCodeIso2, 175 + MentionDID: mentionDIDs, 176 + LinkURL: linkURLs, 177 + EmbedURL: embedURL, 178 + EmbedATURI: embedATURI, 179 + ReplyRootATURI: replyRootATURI, 180 + EmbedImgCount: embedImgCount, 181 + EmbedImgAltText: embedImgAltText, 182 + EmbedImgAltTextJA: embedImgAltTextJA, 183 + SelfLabel: selfLabels, 184 + Tag: parsePostTags(post), 185 + Emoji: parseEmojis(post.Text), 186 + } 187 + 188 + if containsJapanese(post.Text) { 189 + doc.TextJA = post.Text 179 190 } 180 191 181 192 if post.CreatedAt != "" {