A fork of https://github.com/crosspoint-reader/crosspoint-reader
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

fix: Support hyphenation for EPUBs using ISO 639-2 language codes (#1461)

## Summary

EPUBs that use ISO 639-2 three-letter language codes in their
`dc:language` metadata (e.g. `<dc:language>eng</dc:language>`) got no
hyphenation. The hyphenator registry only matched ISO 639-1 two-letter
codes (`"en"`, `"fr"`, etc.), so `"eng"` produced a null hyphenator and
every word in the book was treated as unhyphenatable.
Added a normalization step in `hyphenatorForLanguage` that maps ISO
639-2 codes (both bibliographic and terminological variants) to their
two-letter equivalents before the registry lookup.

Discovered via *Project Hail Mary* (Random House), which uses
`<dc:language>eng</dc:language>`.

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**PARTIALLY**_

authored by

Zach Nelson and committed by
GitHub
1c133311 9b388513

+24 -3
+1 -1
lib/Epub/Epub/Section.cpp
··· 10 10 #include "parsers/ChapterHtmlSlimParser.h" 11 11 12 12 namespace { 13 - constexpr uint8_t SECTION_FILE_VERSION = 18; 13 + constexpr uint8_t SECTION_FILE_VERSION = 19; 14 14 constexpr uint32_t HEADER_SIZE = sizeof(uint8_t) + sizeof(int) + sizeof(float) + sizeof(bool) + sizeof(uint8_t) + 15 15 sizeof(uint16_t) + sizeof(uint16_t) + sizeof(uint16_t) + sizeof(bool) + sizeof(bool) + 16 16 sizeof(uint8_t) + sizeof(uint32_t) + sizeof(uint32_t);
+23 -2
lib/Epub/Epub/hyphenation/Hyphenator.cpp
··· 12 12 13 13 namespace { 14 14 15 - // Maps a BCP-47 language tag to a language-specific hyphenator. 15 + // Normalize ISO 639-2 (three-letter) codes to ISO 639-1 (two-letter) codes used by the 16 + // hyphenation registry. EPUBs may use either form in their dc:language metadata (e.g. 17 + // "eng" instead of "en"). Both the bibliographic ("fre"/"ger") and terminological 18 + // ("fra"/"deu") ISO 639-2 variants are mapped. 19 + struct Iso639Mapping { 20 + const char* iso639_2; 21 + const char* iso639_1; 22 + }; 23 + static constexpr Iso639Mapping kIso639Mappings[] = { 24 + {"eng", "en"}, {"fra", "fr"}, {"fre", "fr"}, {"deu", "de"}, {"ger", "de"}, 25 + {"rus", "ru"}, {"spa", "es"}, {"ita", "it"}, {"ukr", "uk"}, 26 + }; 27 + 28 + // Maps a BCP-47 or ISO 639-2 language tag to a language-specific hyphenator. 16 29 const LanguageHyphenator* hyphenatorForLanguage(const std::string& langTag) { 17 30 if (langTag.empty()) return nullptr; 18 31 19 - // Extract primary subtag and normalize to lowercase (e.g., "en-US" -> "en"). 32 + // Extract primary subtag and normalize to lowercase (e.g., "en-US" -> "en", "ENG" -> "en"). 20 33 std::string primary; 21 34 primary.reserve(langTag.size()); 22 35 for (char c : langTag) { ··· 25 38 primary.push_back(c); 26 39 } 27 40 if (primary.empty()) return nullptr; 41 + 42 + // Normalize ISO 639-2 three-letter codes to two-letter equivalents. 43 + for (const auto& mapping : kIso639Mappings) { 44 + if (primary == mapping.iso639_2) { 45 + primary = mapping.iso639_1; 46 + break; 47 + } 48 + } 28 49 29 50 return getLanguageHyphenatorForPrimaryTag(primary); 30 51 }