We do check UTF-8 correctness in checkpatch.pl (search for "patch and
commit message should be encoded in UTF-8"), but we count bytes, not
symbols when limiting line-length. Let's be consistent.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
---
scripts/checkpatch.pl | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 2189db19f54..711539bdd7c 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -7,6 +7,7 @@
use strict;
use warnings;
+use Encode qw(decode);
use Term::ANSIColor qw(:constants);
my $P = $0;
@@ -596,7 +597,11 @@ sub line_stats {
# Pick the indent from the front of the line.
my ($white) = ($line =~ /^(\s*)/);
- return (length($line), length($white));
+ # Use character count (not byte count) so multi-byte UTF-8 characters
+ # are counted as single characters.
+ my $line_chars = length(decode('UTF-8', $line, Encode::FB_DEFAULT));
+ my $white_chars = length(decode('UTF-8', $white, Encode::FB_DEFAULT));
+ return ($line_chars, $white_chars);
}
my $sanitise_quote = '';
--
2.52.0
On Mon, Jun 01, 2026 at 08:10:08PM +0800, Vladimir Sementsov-Ogievskiy wrote:
> We do check UTF-8 correctness in checkpatch.pl (search for "patch and
> commit message should be encoded in UTF-8"), but we count bytes, not
> symbols when limiting line-length. Let's be consistent.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> ---
> scripts/checkpatch.pl | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 2189db19f54..711539bdd7c 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -7,6 +7,7 @@
>
> use strict;
> use warnings;
> +use Encode qw(decode);
> use Term::ANSIColor qw(:constants);
>
> my $P = $0;
> @@ -596,7 +597,11 @@ sub line_stats {
> # Pick the indent from the front of the line.
> my ($white) = ($line =~ /^(\s*)/);
line_stats() still expands tabs before decoding UTF-8, so tab stops are
computed from UTF-8 bytes rather than decoded characters.
For example, `é<TAB>X` is counted as length 8 with the current order, but
length 9 if decoded before tab expansion. This is mostly hidden for C
files because tabs are rejected, but `.s`/`.S` files may contain tabs.
Since this patch aims to make `line_stats()` use character semantics, it
should decode before expanding tabs, or make `expand_tabs()` operate on a
decoded string.
You could implement it like this:
```
@@ -588,8 +589,9 @@ sub line_stats {
sub line_stats {
my ($line) = @_;
- # Drop the diff line leader and expand tabs
+ # Drop the diff line leader, decode UTF-8, and expand tabs.
$line =~ s/^.//;
+ $line = decode('UTF-8', $line, Encode::FB_DEFAULT);
$line = expand_tabs($line);
# Pick the indent from the front of the line.
```
>
> - return (length($line), length($white));
With that change, the original `return (length($line), length($white));`
can be kept; the extra `decode()` calls and new return below are not needed.
Thanks,
Chao
> + # Use character count (not byte count) so multi-byte UTF-8 characters
> + # are counted as single characters.
> + my $line_chars = length(decode('UTF-8', $line, Encode::FB_DEFAULT));
> + my $white_chars = length(decode('UTF-8', $white, Encode::FB_DEFAULT));
> + return ($line_chars, $white_chars);
> }
>
> my $sanitise_quote = '';
> --
> 2.52.0
>
On 02.06.26 16:16, Chao Liu wrote:
> On Mon, Jun 01, 2026 at 08:10:08PM +0800, Vladimir Sementsov-Ogievskiy wrote:
>> We do check UTF-8 correctness in checkpatch.pl (search for "patch and
>> commit message should be encoded in UTF-8"), but we count bytes, not
>> symbols when limiting line-length. Let's be consistent.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>> ---
>> scripts/checkpatch.pl | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>> index 2189db19f54..711539bdd7c 100755
>> --- a/scripts/checkpatch.pl
>> +++ b/scripts/checkpatch.pl
>> @@ -7,6 +7,7 @@
>>
>> use strict;
>> use warnings;
>> +use Encode qw(decode);
>> use Term::ANSIColor qw(:constants);
>>
>> my $P = $0;
>> @@ -596,7 +597,11 @@ sub line_stats {
>> # Pick the indent from the front of the line.
>> my ($white) = ($line =~ /^(\s*)/);
> line_stats() still expands tabs before decoding UTF-8, so tab stops are
> computed from UTF-8 bytes rather than decoded characters.
>
> For example, `é<TAB>X` is counted as length 8 with the current order, but
> length 9 if decoded before tab expansion. This is mostly hidden for C
> files because tabs are rejected, but `.s`/`.S` files may contain tabs.
>
> Since this patch aims to make `line_stats()` use character semantics, it
> should decode before expanding tabs, or make `expand_tabs()` operate on a
> decoded string.
>
> You could implement it like this:
>
> ```
> @@ -588,8 +589,9 @@ sub line_stats {
> sub line_stats {
> my ($line) = @_;
>
> - # Drop the diff line leader and expand tabs
> + # Drop the diff line leader, decode UTF-8, and expand tabs.
> $line =~ s/^.//;
> + $line = decode('UTF-8', $line, Encode::FB_DEFAULT);
> $line = expand_tabs($line);
>
> # Pick the indent from the front of the line.
> ```
>>
>> - return (length($line), length($white));
> With that change, the original `return (length($line), length($white));`
> can be kept; the extra `decode()` calls and new return below are not needed.
>
Thanks a lot, I'll resend.
--
Best regards,
Vladimir
© 2016 - 2026 Red Hat, Inc.