@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge
phorge phabricator
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Improve performance of Ferret engine ngram extraction, particularly for large input strings

Summary:
See PHI87. Ref T12974. The `array_slice()` method of splitting the string apart can perform poorly for large input strings. I think this is mostly just the large number of calls plus building and returning an array being not entirely trivial.

We can just use `substr()` instead, as long as we're a little bit careful about keeping track of where we're slicing the string if it has UTF8 characters.

Test Plan:
- Created a task with a single, unbroken blob of base64 encoded data as the description, roughly 100KB long.
- Saw indexing performance improve from ~6s to ~1.5s after patch.
- Before: https://secure.phabricator.com/xhprof/profile/PHID-FILE-nrxs4lwdvupbve5lhl6u/
- After: https://secure.phabricator.com/xhprof/profile/PHID-FILE-6vs2akgjj5nbqt7yo7ul/

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T12974

Differential Revision: https://secure.phabricator.com/D18649

+48 -4
+18 -4
src/applications/search/ferret/PhabricatorFerretEngine.php
··· 106 106 $ngrams = array(); 107 107 foreach ($unique_tokens as $token => $ignored) { 108 108 $token_v = phutil_utf8v($token); 109 - $len = (count($token_v) - 2); 110 - for ($ii = 0; $ii < $len; $ii++) { 111 - $ngram = array_slice($token_v, $ii, 3); 112 - $ngram = implode('', $ngram); 109 + $length = count($token_v); 110 + 111 + // NOTE: We're being somewhat clever here to micro-optimize performance, 112 + // especially for very long strings. See PHI87. 113 + 114 + $token_l = array(); 115 + for ($ii = 0; $ii < $length; $ii++) { 116 + $token_l[$ii] = strlen($token_v[$ii]); 117 + } 118 + 119 + $ngram_count = $length - 2; 120 + $cursor = 0; 121 + for ($ii = 0; $ii < $ngram_count; $ii++) { 122 + $ngram_l = $token_l[$ii] + $token_l[$ii + 1] + $token_l[$ii + 2]; 123 + 124 + $ngram = substr($token, $cursor, $ngram_l); 113 125 $ngrams[$ngram] = $ngram; 126 + 127 + $cursor += $token_l[$ii]; 114 128 } 115 129 } 116 130
+30
src/applications/search/ferret/__tests__/PhabricatorFerretEngineTestCase.php
··· 24 24 } 25 25 } 26 26 27 + public function testTermNgramExtraction() { 28 + $snowman = "\xE2\x98\x83"; 29 + 30 + $map = array( 31 + 'a' => array(' a '), 32 + 'ab' => array(' ab', 'ab '), 33 + 'abcdef' => array(' ab', 'abc', 'bcd', 'cde', 'def', 'ef '), 34 + "{$snowman}" => array(" {$snowman} "), 35 + "x{$snowman}y" => array( 36 + " x{$snowman}", 37 + "x{$snowman}y", 38 + "{$snowman}y ", 39 + ), 40 + "{$snowman}{$snowman}" => array( 41 + " {$snowman}{$snowman}", 42 + "{$snowman}{$snowman} ", 43 + ), 44 + ); 45 + 46 + $engine = new ManiphestTaskFerretEngine(); 47 + 48 + foreach ($map as $input => $expect) { 49 + $actual = $engine->getTermNgramsFromString($input); 50 + $this->assertEqual( 51 + $actual, 52 + $expect, 53 + pht('Term ngrams for: %s.', $input)); 54 + } 55 + } 56 + 27 57 }