[PATCH] checkpatch: use utf-8 match for spell checking

Antonio Borneo posted 1 patch 2 years, 1 month ago
There is a newer version of this series
scripts/checkpatch.pl | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
[PATCH] checkpatch: use utf-8 match for spell checking
Posted by Antonio Borneo 2 years, 1 month ago
The current code that checks for misspelling verifies, in a more
complex regex, if $rawline matches [^\w]($misspellings)[^\w]

Being $rawline a byte-string, a utf-8 character in $rawline can
match the non-word-char [^\w].
E.g.:
	./script/checkpatch.pl --git 81c2f059ab9
	WARNING: 'ment' may be misspelled - perhaps 'meant'?
	#36: FILE: MAINTAINERS:14360:
	+M:     Clément Léger <clement.leger@bootlin.com>
	            ^^^^

Use a utf-8 version of $rawline for spell checking.

Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>
Reported-by: Clément Le Goffic <clement.legoffic@foss.st.com>
---
 scripts/checkpatch.pl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 25fdb7fda112..58646bd6ef56 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3477,7 +3477,8 @@ sub process {
 # Check for various typo / spelling mistakes
 		if (defined($misspellings) &&
 		    ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
-			while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
+			my $rawline_utf8 = decode("utf8", $rawline);
+			while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
 				my $typo = $1;
 				my $blank = copy_spacing($rawline);
 				my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);

base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
-- 
2.42.0

Re: [PATCH] checkpatch: use utf-8 match for spell checking
Posted by Joe Perches 2 years, 1 month ago
On Tue, 2023-12-12 at 10:43 +0100, Antonio Borneo wrote:
> The current code that checks for misspelling verifies, in a more
> complex regex, if $rawline matches [^\w]($misspellings)[^\w]
> 
> Being $rawline a byte-string, a utf-8 character in $rawline can
> match the non-word-char [^\w].
> E.g.:
> 	./script/checkpatch.pl --git 81c2f059ab9
> 	WARNING: 'ment' may be misspelled - perhaps 'meant'?
> 	#36: FILE: MAINTAINERS:14360:
> 	+M:     Clément Léger <clement.leger@bootlin.com>
> 	            ^^^^
> 
> Use a utf-8 version of $rawline for spell checking.
> 
> Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>
> Reported-by: Clément Le Goffic <clement.legoffic@foss.st.com>

Seems sensible, thanks, but:

> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -3477,7 +3477,8 @@ sub process {
>  # Check for various typo / spelling mistakes
>  		if (defined($misspellings) &&
>  		    ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
> -			while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> +			my $rawline_utf8 = decode("utf8", $rawline);
> +			while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
>  				my $typo = $1;
>  				my $blank = copy_spacing($rawline);

Maybe this needs to use $rawline_utf8 ?

>  				my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);

And may now the $fix bit will not always work properly
Re: [PATCH] checkpatch: use utf-8 match for spell checking
Posted by Antonio Borneo 2 years, 1 month ago
On Tue, 2023-12-12 at 11:07 -0800, Joe Perches wrote:
> On Tue, 2023-12-12 at 10:43 +0100, Antonio Borneo wrote:
> > The current code that checks for misspelling verifies, in a more
> > complex regex, if $rawline matches [^\w]($misspellings)[^\w]
> > 
> > Being $rawline a byte-string, a utf-8 character in $rawline can
> > match the non-word-char [^\w].
> > E.g.:
> >         ./script/checkpatch.pl --git 81c2f059ab9
> >         WARNING: 'ment' may be misspelled - perhaps 'meant'?
> >         #36: FILE: MAINTAINERS:14360:
> >         +M:     Clément Léger <clement.leger@bootlin.com>
> >                     ^^^^
> > 
> > Use a utf-8 version of $rawline for spell checking.
> > 
> > Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>
> > Reported-by: Clément Le Goffic <clement.legoffic@foss.st.com>
> 
> Seems sensible, thanks, but:
> 
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
> > @@ -3477,7 +3477,8 @@ sub process {
> >  # Check for various typo / spelling mistakes
> >                 if (defined($misspellings) &&
> >                     ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
> > -                       while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> > +                       my $rawline_utf8 = decode("utf8", $rawline);
> > +                       while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> >                                 my $typo = $1;
> >                                 my $blank = copy_spacing($rawline);
> 
> Maybe this needs to use $rawline_utf8 ?

Correct, I will send a v2!

> 
> >                                 my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);
> 
> And may now the $fix bit will not always work properly

I have run some test and it looks ok with current ASCII file scripts/spelling.txt.

I have also tested adding some utf-8 string in the spelling file, but checkpatch reads it as
ASCII and extending it to utf-8 will require further modifications in checkpatch, way beyond
this simple fix.

Thanks for the review.
Antonio

[PATCH v2] checkpatch: use utf-8 match for spell checking
Posted by Antonio Borneo 2 years, 1 month ago
The current code that checks for misspelling verifies, in a more
complex regex, if $rawline matches [^\w]($misspellings)[^\w]

Being $rawline a byte-string, a utf-8 character in $rawline can
match the non-word-char [^\w].
E.g.:
	./scripts/checkpatch.pl --git 81c2f059ab9
	WARNING: 'ment' may be misspelled - perhaps 'meant'?
	#36: FILE: MAINTAINERS:14360:
	+M:     Clément Léger <clement.leger@bootlin.com>
	            ^^^^

Use a utf-8 version of $rawline for spell checking.

Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>
Reported-by: Clément Le Goffic <clement.legoffic@foss.st.com>
---
Changes in v2:
- use $rawline_utf8 also in the while-loop's body;
- fix path of checkpatch in the commit message.
---
 scripts/checkpatch.pl | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 25fdb7fda112..2d122d232c6d 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3477,9 +3477,10 @@ sub process {
 # Check for various typo / spelling mistakes
 		if (defined($misspellings) &&
 		    ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
-			while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
+			my $rawline_utf8 = decode("utf8", $rawline);
+			while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
 				my $typo = $1;
-				my $blank = copy_spacing($rawline);
+				my $blank = copy_spacing($rawline_utf8);
 				my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);
 				my $hereptr = "$hereline$ptr\n";
 				my $typo_fix = $spelling_fix{lc($typo)};

base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
-- 
2.42.0

Re: [PATCH v2] checkpatch: use utf-8 match for spell checking
Posted by Clement LE GOFFIC 1 year, 9 months ago
Hello,

A gentle reminder to review this patch.

Best regards,

Clément

On 1/2/24 17:10, Antonio Borneo wrote:
> The current code that checks for misspelling verifies, in a more
> complex regex, if $rawline matches [^\w]($misspellings)[^\w]
> 
> Being $rawline a byte-string, a utf-8 character in $rawline can
> match the non-word-char [^\w].
> E.g.:
> 	./scripts/checkpatch.pl --git 81c2f059ab9
> 	WARNING: 'ment' may be misspelled - perhaps 'meant'?
> 	#36: FILE: MAINTAINERS:14360:
> 	+M:     Clément Léger <clement.leger@bootlin.com>
> 	            ^^^^
> 
> Use a utf-8 version of $rawline for spell checking.
> 
> Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>
> Reported-by: Clément Le Goffic <clement.legoffic@foss.st.com>
> ---
> Changes in v2:
> - use $rawline_utf8 also in the while-loop's body;
> - fix path of checkpatch in the commit message.
> ---
>   scripts/checkpatch.pl | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 25fdb7fda112..2d122d232c6d 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3477,9 +3477,10 @@ sub process {
>   # Check for various typo / spelling mistakes
>   		if (defined($misspellings) &&
>   		    ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
> -			while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> +			my $rawline_utf8 = decode("utf8", $rawline);
> +			while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
>   				my $typo = $1;
> -				my $blank = copy_spacing($rawline);
> +				my $blank = copy_spacing($rawline_utf8);
>   				my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);
>   				my $hereptr = "$hereline$ptr\n";
>   				my $typo_fix = $spelling_fix{lc($typo)};
> 
> base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
Re: [PATCH v2] checkpatch: use utf-8 match for spell checking
Posted by Clement LE GOFFIC 8 months, 3 weeks ago
On 5/6/24 14:07, Clement LE GOFFIC wrote:
> Hello,
> 
> A gentle reminder to review this patch.
> 
> Best regards,
> 
> Clément
> 
> On 1/2/24 17:10, Antonio Borneo wrote:
>> The current code that checks for misspelling verifies, in a more
>> complex regex, if $rawline matches [^\w]($misspellings)[^\w]
>>
>> Being $rawline a byte-string, a utf-8 character in $rawline can
>> match the non-word-char [^\w].
>> E.g.:
>>     ./scripts/checkpatch.pl --git 81c2f059ab9
>>     WARNING: 'ment' may be misspelled - perhaps 'meant'?
>>     #36: FILE: MAINTAINERS:14360:
>>     +M:     Clément Léger <clement.leger@bootlin.com>
>>                 ^^^^
>>
>> Use a utf-8 version of $rawline for spell checking.
>>
>> Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>
>> Reported-by: Clément Le Goffic <clement.legoffic@foss.st.com>
>> ---
>> Changes in v2:
>> - use $rawline_utf8 also in the while-loop's body;
>> - fix path of checkpatch in the commit message.
>> ---
>>   scripts/checkpatch.pl | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>> index 25fdb7fda112..2d122d232c6d 100755
>> --- a/scripts/checkpatch.pl
>> +++ b/scripts/checkpatch.pl
>> @@ -3477,9 +3477,10 @@ sub process {
>>   # Check for various typo / spelling mistakes
>>           if (defined($misspellings) &&
>>               ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
>> -            while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?: 
>> [^\w\-'`]|$)/gi) {
>> +            my $rawline_utf8 = decode("utf8", $rawline);
>> +            while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings) 
>> (?:[^\w\-'`]|$)/gi) {
>>                   my $typo = $1;
>> -                my $blank = copy_spacing($rawline);
>> +                my $blank = copy_spacing($rawline_utf8);
>>                   my $ptr = substr($blank, 0, $-[1]) . "^" x 
>> length($typo);
>>                   my $hereptr = "$hereline$ptr\n";
>>                   my $typo_fix = $spelling_fix{lc($typo)};
>>
>> base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86

Hi,

Is it just due to -ENOTIME for the maintainers, or are there doubts 
about this patch? (inspired from a response of Uwe).

Clément