checkpatch: calculate UTF-8 characters correctly

[PATCH] checkpatch: calculate UTF-8 characters correctly

Posted by Vladimir Sementsov-Ogievskiy 6 days, 11 hours ago

We do check UTF-8 correctness in checkpatch.pl (search for "patch and
commit message should be encoded in UTF-8"), but we count bytes, not
symbols when limiting line-length. Let's be consistent.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
---
 scripts/checkpatch.pl | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 2189db19f54..711539bdd7c 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -7,6 +7,7 @@
 
 use strict;
 use warnings;
+use Encode qw(decode);
 use Term::ANSIColor qw(:constants);
 
 my $P = $0;
@@ -596,7 +597,11 @@ sub line_stats {
 	# Pick the indent from the front of the line.
 	my ($white) = ($line =~ /^(\s*)/);
 
-	return (length($line), length($white));
+	# Use character count (not byte count) so multi-byte UTF-8 characters
+	# are counted as single characters.
+	my $line_chars  = length(decode('UTF-8', $line,  Encode::FB_DEFAULT));
+	my $white_chars = length(decode('UTF-8', $white, Encode::FB_DEFAULT));
+	return ($line_chars, $white_chars);
 }
 
 my $sanitise_quote = '';
-- 
2.52.0

Re: [PATCH] checkpatch: calculate UTF-8 characters correctly

Posted by Chao Liu 5 days, 15 hours ago

On Mon, Jun 01, 2026 at 08:10:08PM +0800, Vladimir Sementsov-Ogievskiy wrote:
> We do check UTF-8 correctness in checkpatch.pl (search for "patch and
> commit message should be encoded in UTF-8"), but we count bytes, not
> symbols when limiting line-length. Let's be consistent.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> ---
>  scripts/checkpatch.pl | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 2189db19f54..711539bdd7c 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -7,6 +7,7 @@
>  
>  use strict;
>  use warnings;
> +use Encode qw(decode);
>  use Term::ANSIColor qw(:constants);
>  
>  my $P = $0;
> @@ -596,7 +597,11 @@ sub line_stats {
>  	# Pick the indent from the front of the line.
>  	my ($white) = ($line =~ /^(\s*)/);
line_stats() still expands tabs before decoding UTF-8, so tab stops are
computed from UTF-8 bytes rather than decoded characters. 

For example, `é<TAB>X` is counted as length 8 with the current order, but
length 9 if decoded before tab expansion. This is mostly hidden for C
files because tabs are rejected, but `.s`/`.S` files may contain tabs.

Since this patch aims to make `line_stats()` use character semantics, it
should decode before expanding tabs, or make `expand_tabs()` operate on a
decoded string.

You could implement it like this:

```
   @@ -588,8 +589,9 @@ sub line_stats {
    sub line_stats {
       my ($line) = @_;

   -   # Drop the diff line leader and expand tabs
   +   # Drop the diff line leader, decode UTF-8, and expand tabs.
       $line =~ s/^.//;
   +   $line = decode('UTF-8', $line, Encode::FB_DEFAULT);
       $line = expand_tabs($line);

       # Pick the indent from the front of the line.
```
>  
> -	return (length($line), length($white));
With that change, the original `return (length($line), length($white));`
can be kept; the extra `decode()` calls and new return below are not needed.

Thanks,
Chao

> +	# Use character count (not byte count) so multi-byte UTF-8 characters
> +	# are counted as single characters.
> +	my $line_chars  = length(decode('UTF-8', $line,  Encode::FB_DEFAULT));
> +	my $white_chars = length(decode('UTF-8', $white, Encode::FB_DEFAULT));
> +	return ($line_chars, $white_chars);
>  }
>  
>  my $sanitise_quote = '';
> -- 
> 2.52.0
>

Re: [PATCH] checkpatch: calculate UTF-8 characters correctly

Posted by Vladimir Sementsov-Ogievskiy 5 days, 14 hours ago

On 02.06.26 16:16, Chao Liu wrote:
> On Mon, Jun 01, 2026 at 08:10:08PM +0800, Vladimir Sementsov-Ogievskiy wrote:
>> We do check UTF-8 correctness in checkpatch.pl (search for "patch and
>> commit message should be encoded in UTF-8"), but we count bytes, not
>> symbols when limiting line-length. Let's be consistent.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>> ---
>>   scripts/checkpatch.pl | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>> index 2189db19f54..711539bdd7c 100755
>> --- a/scripts/checkpatch.pl
>> +++ b/scripts/checkpatch.pl
>> @@ -7,6 +7,7 @@
>>   
>>   use strict;
>>   use warnings;
>> +use Encode qw(decode);
>>   use Term::ANSIColor qw(:constants);
>>   
>>   my $P = $0;
>> @@ -596,7 +597,11 @@ sub line_stats {
>>   	# Pick the indent from the front of the line.
>>   	my ($white) = ($line =~ /^(\s*)/);
> line_stats() still expands tabs before decoding UTF-8, so tab stops are
> computed from UTF-8 bytes rather than decoded characters.
> 
> For example, `é<TAB>X` is counted as length 8 with the current order, but
> length 9 if decoded before tab expansion. This is mostly hidden for C
> files because tabs are rejected, but `.s`/`.S` files may contain tabs.
> 
> Since this patch aims to make `line_stats()` use character semantics, it
> should decode before expanding tabs, or make `expand_tabs()` operate on a
> decoded string.
> 
> You could implement it like this:
> 
> ```
>     @@ -588,8 +589,9 @@ sub line_stats {
>      sub line_stats {
>         my ($line) = @_;
> 
>     -   # Drop the diff line leader and expand tabs
>     +   # Drop the diff line leader, decode UTF-8, and expand tabs.
>         $line =~ s/^.//;
>     +   $line = decode('UTF-8', $line, Encode::FB_DEFAULT);
>         $line = expand_tabs($line);
> 
>         # Pick the indent from the front of the line.
> ```
>>   
>> -	return (length($line), length($white));
> With that change, the original `return (length($line), length($white));`
> can be kept; the extra `decode()` calls and new return below are not needed.
> 

Thanks a lot, I'll resend.


-- 
Best regards,
Vladimir