fix: WiFi lifecycle and hyphenation heap defragmentation for KOReader sync (#1151)

## Summary

* **What is the goal of this PR?** KOReader sync on a German-language
book would fail with an out-of-memory error when trying to open the
destination chapter after applying remote progress. The root cause was a
chain of two independent bugs that combined to exhaust the contiguous
heap needed by the EPUB inflate pipeline.
* **What changes are included?**
## Fix 1 — Hyphenation heap defragmentation (LiangHyphenation.cpp)
### What was happening
AugmentedWord, the internal struct used during Liang pattern matching,
held three std::vector<> members (bytes, charByteOffsets,
byteToCharIndex) plus a separate scores vector — a total of 4 heap
allocations per word during page layout. For a German-language section
with hundreds of words, thousands of small malloc/free cycles fragmented
the heap. Total free memory was adequate (~108 KB) but the largest
contiguous block shrank well below the 32 KB needed for the INFLATE ring
buffer used during EPUB decompression. The failure was invisible with
hyphenation disabled, where MaxAlloc stayed at ~77 KB; enabling German
hyphenation silently destroyed the contiguity the allocator needed.

### What changed
The three std::vector<> members of AugmentedWord and the scores vector
are replaced with fixed-size C arrays on the render-task stack:
```
uint8_t bytes[160] // was std::vector<uint8_t>
size_t charByteOffsets[70] // was std::vector<size_t>
int32_t byteToCharIndex[160] // was std::vector<int32_t>
uint8_t scores[70] // was std::vector<uint8_t> (local in liangBreakIndexes)
```
Sizing is based on the longest known German word (~63 codepoints × 2
UTF-8 bytes + 2 sentinel dots = 128 bytes); MAX_WORD_BYTES=160 and
MAX_WORD_CHARS=70 give comfortable headroom. The same analysis holds for
all seven supported languages (en, fr, de, es, it, ru, uk) — every
accepted letter encodes to at most 2 UTF-8 bytes after case-folding.
Words exceeding the limits are silently skipped (no hyphenation
applied), which is correct behaviour. The struct lives on the 8 KB
render-task stack so no permanent DRAM is consumed.

Verification: after the fix, MaxAlloc reads 77,812 bytes with German
hyphenation enabled — identical to the figure previously achievable only
with hyphenation off.

## Fix 2 — WiFi lifecycle in KOReaderSyncActivity
(KOReaderSyncActivity.cpp)
### What was happening

onEnter() called WiFi.mode(WIFI_STA) unconditionally before delegating
to WifiSelectionActivity. WifiSelectionActivity manages WiFi mode
internally (it calls WiFi.mode(WIFI_STA) again at scan start and at
connection attempt). The pre-emptive call from KOReaderSyncActivity
interfered with the sub-activity's own state machine, causing
intermittent connection failures that were difficult to reproduce.

Additionally, WiFi was only shut down in onExit(). If the user chose
"Apply remote progress" the activity exited without turning WiFi off
first, leaving the radio on and its memory allocated while the EPUB was
being decompressed — unnecessarily consuming the contiguous heap
headroom that inflate needed.

### What changed

* WiFi.mode(WIFI_STA) removed from onEnter(). WifiSelectionActivity owns
WiFi mode; KOReaderSyncActivity should not touch it before the
sub-activity runs.
* A wifiOff() helper (SNTP stop + disconnect + WIFI_OFF with settling
delays) is extracted into the anonymous namespace and called at every
web-session exit point:
- "Apply remote" path in loop() — before onSyncComplete()
- performUpload() success path
- performUpload() failure path
- onExit() (safety net for all other exit paths)

## Additional Context

* Add any other information that might be helpful for the reviewer
(e.g., performance implications, potential risks,
specific areas to focus on).

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_ and two days of
blood, sweat and heavy swearing...

authored by

jpirnay and committed by

GitHub 4 months ago b695a48a a6c5d9aa

+133 -67

2 changed files

expand all

lib

Epub

hyphenation

LiangHyphenation.cpp

src

activities

reader

KOReaderSyncActivity.cpp

+118 -56

lib/Epub/Epub/hyphenation/LiangHyphenation.cpp

··· 49 49 * trie. All lookups stay within the generated blob, which lives in flash, and 50 50 * the working buffers (augmented bytes/scores) scale with the word length rather 51 51 * than the pattern corpus. 52 + * 53 + * Memory design note (heap fragmentation avoidance) 54 + * -------------------------------------------------- 55 + * AugmentedWord previously held three std::vector<> members that were heap- 56 + * allocated and freed for every word during layout. For a German-language section 57 + * with hundreds of words, these thousands of small alloc/free cycles fragment 58 + * the heap enough to prevent large contiguous allocations (e.g. a 32 KB inflate 59 + * ring buffer) even when total free memory is sufficient. 60 + * 61 + * The fix replaces those vectors with fixed-size C arrays sized for the longest 62 + * plausible word. The longest known German word is ~63 codepoints; with up to 63 + * 2 UTF-8 bytes per German letter + 2 sentinel dots = 128 bytes. MAX_WORD_BYTES=160 64 + * and MAX_WORD_CHARS=70 give comfortable headroom. Words exceeding these limits 65 + * are silently skipped (no hyphenation), which is acceptable for correctness. 66 + * The struct lives on the render-task stack (8 KB) so no permanent DRAM is wasted. 52 67 */ 53 68 54 69 namespace { 55 70 56 71 using EmbeddedAutomaton = SerializedHyphenationPatterns; 57 72 73 + // Upper bounds for the fixed word buffers. Sized for German (longest known word 74 + // ≈63 codepoints × 2 UTF-8 bytes + 2 sentinel dots = 128 bytes). Words that 75 + // exceed these limits are skipped rather than heap-allocated. 76 + static constexpr size_t MAX_WORD_BYTES = 160; // max UTF-8 bytes in augmented word 77 + static constexpr size_t MAX_WORD_CHARS = 70; // max codepoints + 2 sentinel dots 78 + 58 79 struct AugmentedWord { 59 - std::vector<uint8_t> bytes; 60 - std::vector<size_t> charByteOffsets; 61 - std::vector<int32_t> byteToCharIndex; 80 + uint8_t bytes[MAX_WORD_BYTES]; 81 + size_t charByteOffsets[MAX_WORD_CHARS]; 82 + int32_t byteToCharIndex[MAX_WORD_BYTES]; 83 + size_t byteLen = 0; 84 + size_t charCount_ = 0; 62 85 63 - bool empty() const { return bytes.empty(); } 64 - size_t charCount() const { return charByteOffsets.size(); } 86 + bool empty() const { return byteLen == 0; } 87 + size_t charCount() const { return charCount_; } 65 88 }; 66 89 67 - // Encode a single Unicode codepoint into UTF-8 and append to the provided buffer. 68 - size_t encodeUtf8(uint32_t cp, std::vector<uint8_t>& out) { 90 + // Encode a single Unicode codepoint into UTF-8 and append to word.bytes[]. 91 + // Returns the number of bytes written, or 0 if the codepoint is invalid or the 92 + // buffer would overflow. Surrogates (0xD800–0xDFFF) and values above 0x10FFFF 93 + // are not valid Unicode scalar values and are rejected. 94 + size_t encodeUtf8(uint32_t cp, AugmentedWord& word) { 95 + if ((cp >= 0xD800u && cp <= 0xDFFFu) || cp > 0x10FFFFu) { 96 + return 0; 97 + } 98 + 99 + uint8_t encoded[4]; 100 + size_t len = 0; 101 + 69 102 if (cp <= 0x7Fu) { 70 - out.push_back(static_cast<uint8_t>(cp)); 71 - return 1; 103 + encoded[len++] = static_cast<uint8_t>(cp); 104 + } else if (cp <= 0x7FFu) { 105 + encoded[len++] = static_cast<uint8_t>(0xC0u | ((cp >> 6) & 0x1Fu)); 106 + encoded[len++] = static_cast<uint8_t>(0x80u | (cp & 0x3Fu)); 107 + } else if (cp <= 0xFFFFu) { 108 + encoded[len++] = static_cast<uint8_t>(0xE0u | ((cp >> 12) & 0x0Fu)); 109 + encoded[len++] = static_cast<uint8_t>(0x80u | ((cp >> 6) & 0x3Fu)); 110 + encoded[len++] = static_cast<uint8_t>(0x80u | (cp & 0x3Fu)); 111 + } else { 112 + encoded[len++] = static_cast<uint8_t>(0xF0u | ((cp >> 18) & 0x07u)); 113 + encoded[len++] = static_cast<uint8_t>(0x80u | ((cp >> 12) & 0x3Fu)); 114 + encoded[len++] = static_cast<uint8_t>(0x80u | ((cp >> 6) & 0x3Fu)); 115 + encoded[len++] = static_cast<uint8_t>(0x80u | (cp & 0x3Fu)); 72 116 } 73 - if (cp <= 0x7FFu) { 74 - out.push_back(static_cast<uint8_t>(0xC0u | ((cp >> 6) & 0x1Fu))); 75 - out.push_back(static_cast<uint8_t>(0x80u | (cp & 0x3Fu))); 76 - return 2; 117 + 118 + if (word.byteLen + len > MAX_WORD_BYTES) { 119 + return 0; // overflow: word too long for fixed buffer, skip hyphenation 77 120 } 78 - if (cp <= 0xFFFFu) { 79 - out.push_back(static_cast<uint8_t>(0xE0u | ((cp >> 12) & 0x0Fu))); 80 - out.push_back(static_cast<uint8_t>(0x80u | ((cp >> 6) & 0x3Fu))); 81 - out.push_back(static_cast<uint8_t>(0x80u | (cp & 0x3Fu))); 82 - return 3; 121 + for (size_t i = 0; i < len; ++i) { 122 + word.bytes[word.byteLen++] = encoded[i]; 83 123 } 84 - out.push_back(static_cast<uint8_t>(0xF0u | ((cp >> 18) & 0x07u))); 85 - out.push_back(static_cast<uint8_t>(0x80u | ((cp >> 12) & 0x3Fu))); 86 - out.push_back(static_cast<uint8_t>(0x80u | ((cp >> 6) & 0x3Fu))); 87 - out.push_back(static_cast<uint8_t>(0x80u | (cp & 0x3Fu))); 88 - return 4; 124 + return len; 89 125 } 90 126 91 - // Build the dotted, lowercase UTF-8 representation plus lookup tables. 92 - AugmentedWord buildAugmentedWord(const std::vector<CodepointInfo>& cps, const LiangWordConfig& config) { 93 - AugmentedWord word; 127 + // Build the dotted, lowercase UTF-8 representation plus lookup tables into `word`. 128 + // Returns false if the word should be skipped (empty, non-letter, or too long). 129 + bool buildAugmentedWord(AugmentedWord& word, const std::vector<CodepointInfo>& cps, const LiangWordConfig& config) { 130 + word.byteLen = 0; 131 + word.charCount_ = 0; 132 + 94 133 if (cps.empty()) { 95 - return word; 134 + return false; 96 135 } 97 136 98 - word.bytes.reserve(cps.size() * 2 + 2); 99 - word.charByteOffsets.reserve(cps.size() + 2); 100 - 101 - word.charByteOffsets.push_back(0); 102 - word.bytes.push_back('.'); 137 + // Leading sentinel '.' 138 + word.charByteOffsets[word.charCount_++] = 0; 139 + word.bytes[word.byteLen++] = '.'; 103 140 104 141 for (const auto& info : cps) { 105 142 if (!config.isLetter(info.value)) { 106 - word.bytes.clear(); 107 - word.charByteOffsets.clear(); 108 - word.byteToCharIndex.clear(); 109 - return word; 143 + word.byteLen = 0; 144 + word.charCount_ = 0; 145 + return false; 110 146 } 111 - word.charByteOffsets.push_back(word.bytes.size()); 112 - encodeUtf8(config.toLower(info.value), word.bytes); 147 + // Reserve one slot for the trailing sentinel and check byte headroom. 148 + if (word.charCount_ >= MAX_WORD_CHARS - 1) { 149 + word.byteLen = 0; 150 + word.charCount_ = 0; 151 + return false; // word too long 152 + } 153 + word.charByteOffsets[word.charCount_++] = word.byteLen; 154 + if (encodeUtf8(config.toLower(info.value), word) == 0) { 155 + word.byteLen = 0; 156 + word.charCount_ = 0; 157 + return false; // byte buffer overflow 158 + } 113 159 } 114 160 115 - word.charByteOffsets.push_back(word.bytes.size()); 116 - word.bytes.push_back('.'); 161 + // Trailing sentinel '.' 162 + if (word.charCount_ >= MAX_WORD_CHARS || word.byteLen >= MAX_WORD_BYTES) { 163 + word.byteLen = 0; 164 + word.charCount_ = 0; 165 + return false; 166 + } 167 + word.charByteOffsets[word.charCount_++] = word.byteLen; 168 + word.bytes[word.byteLen++] = '.'; 117 169 118 - word.byteToCharIndex.assign(word.bytes.size(), -1); 119 - for (size_t i = 0; i < word.charByteOffsets.size(); ++i) { 170 + // Build byte→char reverse index: -1 for mid-codepoint bytes, char index for start bytes. 171 + for (size_t i = 0; i < word.byteLen; ++i) { 172 + word.byteToCharIndex[i] = -1; 173 + } 174 + for (size_t i = 0; i < word.charCount_; ++i) { 120 175 const size_t offset = word.charByteOffsets[i]; 121 - if (offset < word.byteToCharIndex.size()) { 176 + if (offset < word.byteLen) { 122 177 word.byteToCharIndex[offset] = static_cast<int32_t>(i); 123 178 } 124 179 } 125 - return word; 180 + 181 + return true; 126 182 } 127 183 128 184 // Decoded view of a single trie node pulled straight out of the serialized blob. ··· 256 312 // Converts odd score positions back into codepoint indexes, honoring min prefix/suffix constraints. 257 313 // Each break corresponds to scores[breakIndex + 1] because of the leading '.' sentinel. 258 314 // Convert odd score entries into hyphen positions while honoring prefix/suffix limits. 259 - std::vector<size_t> collectBreakIndexes(const std::vector<CodepointInfo>& cps, const std::vector<uint8_t>& scores, 260 - const size_t minPrefix, const size_t minSuffix) { 315 + std::vector<size_t> collectBreakIndexes(const std::vector<CodepointInfo>& cps, const uint8_t* scores, 316 + const size_t scoresSize, const size_t minPrefix, const size_t minSuffix) { 261 317 std::vector<size_t> indexes; 262 318 const size_t cpCount = cps.size(); 263 319 if (cpCount < 2) { ··· 275 331 } 276 332 277 333 const size_t scoreIdx = breakIndex + 1; 278 - if (scoreIdx >= scores.size()) { 334 + if (scoreIdx >= scoresSize) { 279 335 break; 280 336 } 281 337 if ((scores[scoreIdx] & 1u) == 0) { ··· 292 348 // Entry point that runs the full Liang pipeline for a single word. 293 349 std::vector<size_t> liangBreakIndexes(const std::vector<CodepointInfo>& cps, 294 350 const SerializedHyphenationPatterns& patterns, const LiangWordConfig& config) { 295 - const auto augmented = buildAugmentedWord(cps, config); 296 - if (augmented.empty()) { 351 + // AugmentedWord uses fixed-size C arrays (no heap allocation) to avoid 352 + // fragmenting the heap across hundreds of words during page layout. 353 + AugmentedWord augmented; 354 + if (!buildAugmentedWord(augmented, cps, config)) { 297 355 return {}; 298 356 } 299 357 ··· 305 363 } 306 364 307 365 // Liang scores: one entry per augmented char (leading/trailing dots included). 308 - std::vector<uint8_t> scores(augmented.charCount(), 0); 366 + // Stack-allocated to avoid heap fragmentation (see memory design note above). 367 + uint8_t scores[MAX_WORD_CHARS]; 368 + for (size_t i = 0; i < augmented.charCount_; ++i) { 369 + scores[i] = 0; 370 + } 309 371 310 372 // Walk every starting character position and stream bytes through the trie. 311 - for (size_t charStart = 0; charStart < augmented.charByteOffsets.size(); ++charStart) { 373 + for (size_t charStart = 0; charStart < augmented.charCount_; ++charStart) { 312 374 const size_t byteStart = augmented.charByteOffsets[charStart]; 313 375 AutomatonState state = root; 314 376 315 - for (size_t cursor = byteStart; cursor < augmented.bytes.size(); ++cursor) { 377 + for (size_t cursor = byteStart; cursor < augmented.byteLen; ++cursor) { 316 378 AutomatonState next; 317 379 if (!transition(automaton, state, augmented.bytes[cursor], next)) { 318 380 break; // No more matches for this prefix. ··· 329 391 330 392 offset += dist; 331 393 const size_t splitByte = byteStart + offset; 332 - if (splitByte >= augmented.byteToCharIndex.size()) { 394 + if (splitByte >= augmented.byteLen) { 333 395 continue; 334 396 } 335 397 ··· 337 399 if (boundary < 0) { 338 400 continue; // Mid-codepoint byte, wait for the next one. 339 401 } 340 - if (boundary < 2 || boundary + 2 > static_cast<int32_t>(augmented.charCount())) { 402 + if (boundary < 2 || boundary + 2 > static_cast<int32_t>(augmented.charCount_)) { 341 403 continue; // Skip splits that land in the leading/trailing sentinels. 342 404 } 343 405 344 406 const size_t idx = static_cast<size_t>(boundary); 345 - if (idx >= scores.size()) { 407 + if (idx >= augmented.charCount_) { 346 408 continue; 347 409 } 348 410 scores[idx] = std::max(scores[idx], level); ··· 351 413 } 352 414 } 353 415 354 - return collectBreakIndexes(cps, scores, config.minPrefix, config.minSuffix); 416 + return collectBreakIndexes(cps, scores, augmented.charCount_, config.minPrefix, config.minSuffix); 355 417 }

+15 -11

src/activities/reader/KOReaderSyncActivity.cpp

··· 39 39 LOG_DBG("KOSync", "NTP sync timeout, using fallback"); 40 40 } 41 41 } 42 + void wifiOff() { 43 + if (esp_sntp_enabled()) { 44 + esp_sntp_stop(); 45 + } 46 + WiFi.disconnect(false); 47 + delay(100); 48 + WiFi.mode(WIFI_OFF); 49 + delay(100); 50 + } 42 51 } // namespace 43 52 44 53 void KOReaderSyncActivity::onWifiSelectionComplete(const bool success) { ··· 164 173 const auto result = KOReaderSyncClient::updateProgress(progress); 165 174 166 175 if (result != KOReaderSyncClient::OK) { 176 + wifiOff(); 167 177 { 168 178 RenderLock lock(*this); 169 179 state = SYNC_FAILED; ··· 173 183 return; 174 184 } 175 185 186 + wifiOff(); 176 187 { 177 188 RenderLock lock(*this); 178 189 state = UPLOAD_COMPLETE; ··· 190 201 return; 191 202 } 192 203 193 - // Turn on WiFi 194 - LOG_DBG("KOSync", "Turning on WiFi..."); 195 - WiFi.mode(WIFI_STA); 196 - 197 - // Check if already connected 204 + // Check if already connected (e.g. from settings page auth) 198 205 if (WiFi.status() == WL_CONNECTED) { 199 206 LOG_DBG("KOSync", "Already connected to WiFi"); 200 207 state = SYNCING; ··· 228 235 void KOReaderSyncActivity::onExit() { 229 236 ActivityWithSubactivity::onExit(); 230 237 231 - // Turn off wifi 232 - WiFi.disconnect(false); 233 - delay(100); 234 - WiFi.mode(WIFI_OFF); 235 - delay(100); 238 + wifiOff(); 236 239 } 237 240 238 241 void KOReaderSyncActivity::render(Activity::RenderLock&&) { ··· 380 383 381 384 if (mappedInput.wasPressed(MappedInputManager::Button::Confirm)) { 382 385 if (selectedOption == 0) { 383 - // Apply remote progress 386 + // Apply remote progress — WiFi no longer needed 387 + wifiOff(); 384 388 onSyncComplete(remotePosition.spineIndex, remotePosition.pageNumber); 385 389 } else if (selectedOption == 1) { 386 390 // Upload local progress

Configure Feed

Configure Feed