Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

get_maintainer: correctly parse UTF-8 encoded names in files

While the script correctly extracts UTF-8 encoded names from the
MAINTAINERS file, the regular expressions damage my name when parsing
from .yaml files. Fix this by replacing the Latin-1-compatible regular
expressions with the unicode property matcher \p{L}, which matches on
any letter according to the Unicode General Category of letters.

The proposed solution only works if the script uses proper string
encoding from the outset, so instruct Perl to unconditionally open all
files with UTF-8 encoding. This should be safe, as the entire source
tree is either UTF-8 or ASCII encoded anyway. See [1] for a detailed
analysis.

Furthermore, to prevent the \w expression from matching non-ASCII when
checking for whether a name should be escaped with quotes, add the /a
flag to the regular expression. The escaping logic was duplicated in
two places, so it has been factored out into its own function.

The original issue was also identified on the tools mailing list [2].
This should solve the observed side effects there as well.

Link: https://lore.kernel.org/all/dzn6uco4c45oaa3ia4u37uo5mlt33obecv7gghj2l756fr4hdh@mt3cprft3tmq/ [1]
Link: https://lore.kernel.org/tools/20230726-gush-slouching-a5cd41@meerkat/ [2]
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Alvin Šipraga and committed by
Linus Torvalds
9c334eb9 453f5db0

+17 -13
+17 -13
scripts/get_maintainer.pl
··· 20 20 use Cwd; 21 21 use File::Find; 22 22 use File::Spec::Functions; 23 + use open qw(:std :encoding(UTF-8)); 23 24 24 25 my $cur_path = fastgetcwd() . '/'; 25 26 my $lk_path = "./"; ··· 446 445 my $text = do { local($/) ; <$f> }; 447 446 close($f); 448 447 449 - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; 448 + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; 450 449 push(@file_emails, clean_file_emails(@poss_addr)); 451 450 } 452 451 } ··· 1153 1152 return 0; 1154 1153 } 1155 1154 1155 + sub escape_name { 1156 + my ($name) = @_; 1157 + 1158 + if ($name =~ /[^\w \-]/ai) { ##has "must quote" chars 1159 + $name =~ s/(?<!\\)"/\\"/g; ##escape quotes 1160 + $name = "\"$name\""; 1161 + } 1162 + 1163 + return $name; 1164 + } 1165 + 1156 1166 sub parse_email { 1157 1167 my ($formatted_email) = @_; 1158 1168 ··· 1181 1169 1182 1170 $name =~ s/^\s+|\s+$//g; 1183 1171 $name =~ s/^\"|\"$//g; 1172 + $name = escape_name($name); 1184 1173 $address =~ s/^\s+|\s+$//g; 1185 - 1186 - if ($name =~ /[^\w \-]/i) { ##has "must quote" chars 1187 - $name =~ s/(?<!\\)"/\\"/g; ##escape quotes 1188 - $name = "\"$name\""; 1189 - } 1190 1174 1191 1175 return ($name, $address); 1192 1176 } ··· 1194 1186 1195 1187 $name =~ s/^\s+|\s+$//g; 1196 1188 $name =~ s/^\"|\"$//g; 1189 + $name = escape_name($name); 1197 1190 $address =~ s/^\s+|\s+$//g; 1198 - 1199 - if ($name =~ /[^\w \-]/i) { ##has "must quote" chars 1200 - $name =~ s/(?<!\\)"/\\"/g; ##escape quotes 1201 - $name = "\"$name\""; 1202 - } 1203 1191 1204 1192 if ($usename) { 1205 1193 if ("$name" eq "") { ··· 2466 2462 $name = ""; 2467 2463 } 2468 2464 2469 - my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name); 2465 + my @nw = split(/[^\p{L}\'\,\.\+-]/, $name); 2470 2466 if (@nw > 2) { 2471 2467 my $first = $nw[@nw - 3]; 2472 2468 my $middle = $nw[@nw - 2]; 2473 2469 my $last = $nw[@nw - 1]; 2474 2470 2475 - if (((length($first) == 1 && $first =~ m/[A-Za-z]/) || 2471 + if (((length($first) == 1 && $first =~ m/\p{L}/) || 2476 2472 (length($first) == 2 && substr($first, -1) eq ".")) || 2477 2473 (length($middle) == 1 || 2478 2474 (length($middle) == 2 && substr($middle, -1) eq "."))) {