From nobody Fri Dec 19 11:35:58 2025 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9706E15B1 for ; Tue, 19 Dec 2023 01:25:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=pqrs.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=pqrs.dk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=pqrs.dk header.i=@pqrs.dk header.b="lfnekWpA" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pqrs.dk; s=key1; t=1702949137; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PDBV8cJaB7iV+TTP4pZBNgMOM9Qt20Xbuc5uqu6VF04=; b=lfnekWpAB6CzsHQTX6yE3W9MFHdgtKLUx+4hEtPY/9stmu7C/2VGc05jgGkTH1x/jOpqs8 ng9+j8baGfujv1RHW0M44wwhjI0cz9SjOZq0T1H08gqzHJmLylyuYPJFKAyqpzVasBifih OgTGD5tseC6RYuU6emWw1lkSbBiYxApssqU59LuNkvdVt1lIeKk9QiOWx5Xy0ky/jHJ6J0 NHIaPUv67bZZHYsUbbZX7OkRDceSVRkfGKpeplV5sMLL5uzE0uCUccQzPXaVZ3qwOLD1jV ITxlvoapBY1LwCsayvq8vj/UKuHlIIJesyN2jZiyzO6N8Xu77j/2DOdnVT5Bvw== From: =?utf-8?q?Alvin_=C5=A0ipraga?= Date: Tue, 19 Dec 2023 02:25:14 +0100 Subject: [PATCH v3 1/2] get_maintainer: correctly parse UTF-8 encoded names in files Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20231219-get-maintainers-utf8-v3-1-f85a39e2265a@bang-olufsen.dk> References: <20231219-get-maintainers-utf8-v3-0-f85a39e2265a@bang-olufsen.dk> In-Reply-To: <20231219-get-maintainers-utf8-v3-0-f85a39e2265a@bang-olufsen.dk> To: Joe Perches , Linus Torvalds , Andrew Morton Cc: =?utf-8?q?Duje_Mihanovi=C4=87?= , Konstantin Ryabitsev , linux-kernel@vger.kernel.org, =?utf-8?q?Alvin_=C5=A0ipraga?= X-Migadu-Flow: FLOW_OUT From: Alvin =C5=A0ipraga While the script correctly extracts UTF-8 encoded names from the MAINTAINERS file, the regular expressions damage my name when parsing from .yaml files. Fix this by replacing the Latin-1-compatible regular expressions with the unicode property matcher \p{L}, which matches on any letter according to the Unicode General Category of letters. The proposed solution only works if the script uses proper string encoding from the outset, so instruct Perl to unconditionally open all files with UTF-8 encoding. This should be safe, as the entire source tree is either UTF-8 or ASCII encoded anyway. See [1] for a detailed analysis. Furthermore, to prevent the \w expression from matching non-ASCII when checking for whether a name should be escaped with quotes, add the /a flag to the regular expression. The escaping logic was duplicated in two places, so it has been factored out into its own function. The original issue was also identified on the tools mailing list [2]. This should solve the observed side effects there as well. Link: https://lore.kernel.org/all/dzn6uco4c45oaa3ia4u37uo5mlt33obecv7gghj2l= 756fr4hdh@mt3cprft3tmq/ [1] Link: https://lore.kernel.org/tools/20230726-gush-slouching-a5cd41@meerkat/= [2] Signed-off-by: Alvin =C5=A0ipraga --- scripts/get_maintainer.pl | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index 16d8ac6005b6..dac38c6e3b1c 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev); use Cwd; use File::Find; use File::Spec::Functions; +use open qw(:std :encoding(UTF-8)); =20 my $cur_path =3D fastgetcwd() . '/'; my $lk_path =3D "./"; @@ -445,7 +446,7 @@ sub maintainers_in_file { my $text =3D do { local($/) ; <$f> }; close($f); =20 - my @poss_addr =3D $text =3D~ m$[A-Za-z=C3=80-=C3=BF\"\' \,\.\+-]*\s*[\,]*= \s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0= ,1}$g; + my @poss_addr =3D $text =3D~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,= 1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; push(@file_emails, clean_file_emails(@poss_addr)); } } @@ -1152,6 +1153,17 @@ sub top_of_kernel_tree { return 0; } =20 +sub escape_name { + my ($name) =3D @_; + + if ($name =3D~ /[^\w \-]/ai) { ##has "must quote" chars + $name =3D~ s/(? 2) { my $first =3D $nw[@nw - 3]; my $middle =3D $nw[@nw - 2]; my $last =3D $nw[@nw - 1]; =20 - if (((length($first) =3D=3D 1 && $first =3D~ m/[A-Za-z]/) || + if (((length($first) =3D=3D 1 && $first =3D~ m/\p{L}/) || (length($first) =3D=3D 2 && substr($first, -1) eq ".")) || (length($middle) =3D=3D 1 || (length($middle) =3D=3D 2 && substr($middle, -1) eq "."))) { --=20 2.43.0