[v3] uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"

[PATCH v3 RESEND] uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"

Posted by I Hsin Cheng 11 months, 2 weeks ago

This patch reverts 'commit c32ee3d9abd2("bitops: avoid integer overflow in
 GENMASK(_ULL)")'.

The code generation can be shrink by over 1KB by reverting this commit.
Originally the commit claimed that clang would emit warnings using the
implementation at that time.

The patch was applied and tested against numerous compilers, including
gcc-13, gcc-12, gcc-11 cross-compiler, clang-17, clang-18 and clang-19.
Various warning levels were set (-W=0, -W=1, -W=2) and CONFIG_WERROR
disabled to complete the compilation. The results show that no compilation
errors or warnings were generated due to the patch.

The results of code size reduction are summarized in the following table.
The code size changes for clang are all zero across different versions,
so they're not listed in the table.

For NR_CPUS=64 on x86_64.
----------------------------------------------
|	        |   gcc-13 |   gcc-12 |   gcc-11 |
----------------------------------------------
|       old | 22438085 | 22453915 | 22302033 |
----------------------------------------------
|       new | 22436816 | 22452913 | 22300826 |
----------------------------------------------
| new - old |    -1269 |    -1002 |    -1207 |
----------------------------------------------

For NR_CPUS=1024 on x86_64.
----------------------------------------------
|	        |   gcc-13 |   gcc-12 |   gcc-11 |
----------------------------------------------
|       old | 22493682 | 22509812 | 22357661 |
----------------------------------------------
|       new | 22493230 | 22509487 | 22357250 |
----------------------------------------------
| new - old |     -452 |     -325 |     -411 |
----------------------------------------------

For arm64 architecture, gcc cross-compiler was used and QEMU was
utilized to execute a VM for a CPU-heavy workload to ensure no
side effects and that functionalities remained correct. The test
even demonstrated a positive result in terms of code size reduction:
* Before: 31660668
* After: 31658724
* Difference (After - Before): -1944

An analysis of multiple functions compiled with gcc-13 on x86_64 was
performed. In summary, the patch elimates one negation in almost
every use case. However, negative effects may occur in some cases,
such as the generation of additional "mov" instruction or increased
register usage. The use of "~_UL(0) << (l)" may even result in the
allocations of "%r*" registers instead of "%e*" registers (which are
32-bit registers) because the compiler cannot assume that the higher
bits are zero.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
---
Changelog:

v1 -> v2:
	- Refer the patch explicitly as a revert of commit c32ee3d
	- Squash commits into 1 commit
	- Perform compilation for numerous compilers, architectures and
	  compare code size shrink for each of them
	- Perform cpu-heavy workload test inside x86_64 VM and ARM64 VM
	- Analyze the result of disassembly of several small use cases

v2 -> v3:
	- Refactor code to single line
	- Disabled CONFIG_WERROR to ensure no additional compilation errors or warnings
	- Remove description for unrelated compilation errors and warnings
	- Summarize the result into a better looking table
	- Rephrase the patch and fix typos


Hi Yury,

Sorry about the last iteration that I included everything, making
the email too large and difficult to read. However, you still reviewed
it and gave me feedbacks, really big thanks for your patience and those
feedbacks. Running these tests also gave me a great opportunity to learn
a lot.

If there's anything else needed to be test or modified, please let
me know, I'll ammend them as soon as possible.

Hi David,

Thanks for your advise on alternative ideas for this code. I ran some
simple test (basically check the result of code size reduction) based
on your suggestions.

For gcc-13 on x86_64 + defconfig.
* "(_UL(2) << (h)) - (_UL(1) << (l))"
	Before=22438085, After=22438193, Difference ( After - Before ) = 108
* "((_UL(1) + _UL(1)) << (h)) - (_UL(1) << (l))"
	Before=22438085, After=22438209, Difference ( After - Before ) = 124

I tried to do an analysis on "intel_arch_events_quirk()", it only +2 in
code size change, I think it would be a nice example to see the differences
in generated code.

So the result shows that your proposal can save 1 negation and 1 "and".
-ffffffff83278ad2:	48 f7 d8             	neg    %rax
-ffffffff83278adc:	83 e0 7f             	and    $0x7f,%eax

However, one more "mov" and one more "sub" are generated.
+ffffffff83278acf:	b8 80 00 00 00       	mov    $0x80,%eax
+ffffffff83278ad7:	48 29 d0             	sub    %rdx,%rax

No change in total number of instructions, but negation only requires
one register, and the "mov" generated is more expensive then usual.
(Usually "mov" of the following form are generated,
"48 89 ea             	mov    %rbp,%rdx",
"89 c3                	mov    %eax,%ebx" ).

Thanks for your advise again, in some scenario it does have positive
effect, but unfortunately, the overall impact is not beneficial.

Best regards,
I Hsin Cheng.

Tests performed on ubuntu 24.04 with AMD Ryzen 7 5700X3D 8-Core
Processor on x86_64 with kernel version v6.14.0-rc1.
---
 include/uapi/linux/bits.h | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
index 5ee30f882736..682b406e1067 100644
--- a/include/uapi/linux/bits.h
+++ b/include/uapi/linux/bits.h
@@ -4,13 +4,9 @@
 #ifndef _UAPI_LINUX_BITS_H
 #define _UAPI_LINUX_BITS_H
 
-#define __GENMASK(h, l) \
-        (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
-         (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
+#define __GENMASK(h, l) (((~_UL(0)) << (l)) & (~_UL(0) >> (BITS_PER_LONG - 1 - (h))))
 
-#define __GENMASK_ULL(h, l) \
-        (((~_ULL(0)) - (_ULL(1) << (l)) + 1) & \
-         (~_ULL(0) >> (__BITS_PER_LONG_LONG - 1 - (h))))
+#define __GENMASK_ULL(h, l) (((~_ULL(0)) << (l)) & (~_ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
 
 #define __GENMASK_U128(h, l) \
 	((_BIT128((h)) << 1) - (_BIT128(l)))
-- 
2.43.0

Re: [PATCH v3 RESEND] uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"

Posted by David Laight 11 months, 1 week ago

On Wed, 26 Feb 2025 14:56:23 +0800
I Hsin Cheng <richard120310@gmail.com> wrote:

> This patch reverts 'commit c32ee3d9abd2("bitops: avoid integer overflow in
>  GENMASK(_ULL)")'.
> 
> The code generation can be shrink by over 1KB by reverting this commit.
> Originally the commit claimed that clang would emit warnings using the
> implementation at that time.
...
> The results of code size reduction are summarized in the following table.
> The code size changes for clang are all zero across different versions,
> so they're not listed in the table.

I've been looking at the object changes.

I think all the big differences are due to the compiler changing the 'inline'
decision for some functions.
This is most obvious when the functions bloat-o-meter generates differ.
But I've seen odd things like seeing an inlined (IIRC) pud_val() immediately
followed by call to the wrapper (which is just an indirect call).
(The kernel I'm building is for the machine I'm building it on and has pretty
much all the mitigations compiled out for speed.)

I suspect the +1/-1 and extra negates all depend on subtleties in the compiler
and whether it is doing |= GENMASK() or &= ~GENMASK().
Some of the differences are also in for_each_set_bit() which used GENMASK()
with variables - plausibly it needs a specific implementation.

Sometimes you win, sometimes you lose.
But simplicity is a good win, and the current version is anything but.
(It is also broken for ASM because _UL(x) and _ULL(x) are both just (x)
so GENMASK() and GENMASK_ULL() can't both be right.)

	David

Re: [PATCH v3 RESEND] uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"

Posted by Yury Norov 11 months, 2 weeks ago

On Wed, Feb 26, 2025 at 02:56:23PM +0800, I Hsin Cheng wrote:
> This patch reverts 'commit c32ee3d9abd2("bitops: avoid integer overflow in
>  GENMASK(_ULL)")'.
> 
> The code generation can be shrink by over 1KB by reverting this commit.
> Originally the commit claimed that clang would emit warnings using the
> implementation at that time.
> 
> The patch was applied and tested against numerous compilers, including
> gcc-13, gcc-12, gcc-11 cross-compiler, clang-17, clang-18 and clang-19.
> Various warning levels were set (-W=0, -W=1, -W=2) and CONFIG_WERROR
> disabled to complete the compilation. The results show that no compilation
> errors or warnings were generated due to the patch.
> 
> The results of code size reduction are summarized in the following table.
> The code size changes for clang are all zero across different versions,
> so they're not listed in the table.
> 
> For NR_CPUS=64 on x86_64.
> ----------------------------------------------
> |	        |   gcc-13 |   gcc-12 |   gcc-11 |
> ----------------------------------------------
> |       old | 22438085 | 22453915 | 22302033 |
> ----------------------------------------------
> |       new | 22436816 | 22452913 | 22300826 |
> ----------------------------------------------
> | new - old |    -1269 |    -1002 |    -1207 |
> ----------------------------------------------
> 
> For NR_CPUS=1024 on x86_64.
> ----------------------------------------------
> |	        |   gcc-13 |   gcc-12 |   gcc-11 |
> ----------------------------------------------
> |       old | 22493682 | 22509812 | 22357661 |
> ----------------------------------------------
> |       new | 22493230 | 22509487 | 22357250 |
> ----------------------------------------------
> | new - old |     -452 |     -325 |     -411 |
> ----------------------------------------------
> 
> For arm64 architecture, gcc cross-compiler was used and QEMU was
> utilized to execute a VM for a CPU-heavy workload to ensure no
> side effects and that functionalities remained correct. The test
> even demonstrated a positive result in terms of code size reduction:
> * Before: 31660668
> * After: 31658724
> * Difference (After - Before): -1944
> 
> An analysis of multiple functions compiled with gcc-13 on x86_64 was
> performed. In summary, the patch elimates one negation in almost
> every use case. However, negative effects may occur in some cases,
> such as the generation of additional "mov" instruction or increased
> register usage. The use of "~_UL(0) << (l)" may even result in the
> allocations of "%r*" registers instead of "%e*" registers (which are
> 32-bit registers) because the compiler cannot assume that the higher
> bits are zero.
> 
> Signed-off-by: I Hsin Cheng <richard120310@gmail.com>

Applied in bitmap-for-next. Thanks for the work!