Improve Ferret engine indexing performance for large blocks of text

@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge

phorge phabricator

fork

Summary:
See PHI87. Ref T12974. Currently, we do a lot more work here than we need to: we call `phutil_utf8_strtolower()` on each token, but can do it once at the beginning on the whole block.

Additionally, since ngrams don't care about order, we only need to convert unique tokens into ngrams. This saves us some `phutil_utf8v()`. These calls can be slow for large inputs.

Test Plan:
- Created a ~4MB task description.
- Ran `bin/search index Txxx --profile ...` to profile indexing performance before and after the change.
- Saw total runtime drop form 38s to 9s.
- Before: <https://secure.phabricator.com/xhprof/profile/PHID-FILE-wiht5d7lkyazaywwxovw/>
- After: <https://secure.phabricator.com/xhprof/profile/PHID-FILE-efxv56q2hulr6kjrxbx6/>

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T12974

Differential Revision: https://secure.phabricator.com/D18647

epriestley 8 years ago a1d9a238 9f11f310

+10 -3

1 changed file

expand all

src

applications

ferret

PhabricatorFerretEngine.php

+10 -3

src/applications/search/ferret/PhabricatorFerretEngine.php

··· 88 88 } 89 89 90 90 private function getNgramsFromString($value, $as_term) { 91 + $value = phutil_utf8_strtolower($value); 91 92 $tokens = $this->tokenizeString($value); 92 93 93 - $ngrams = array(); 94 + // First, extract unique tokens from the string. This reduces the number 95 + // of `phutil_utf8v()` calls we need to make if we are indexing a large 96 + // corpus with redundant terms. 97 + $unique_tokens = array(); 94 98 foreach ($tokens as $token) { 95 - $token = phutil_utf8_strtolower($token); 96 - 97 99 if ($as_term) { 98 100 $token = ' '.$token.' '; 99 101 } 100 102 103 + $unique_tokens[$token] = true; 104 + } 105 + 106 + $ngrams = array(); 107 + foreach ($unique_tokens as $token => $ignored) { 101 108 $token_v = phutil_utf8v($token); 102 109 $len = (count($token_v) - 2); 103 110 for ($ii = 0; $ii < $len; $ii++) {

Configure Feed

Configure Feed