[v1] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

[PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

Posted by Vincent Mailhol 3 years, 12 months ago

The compilers provides some builtin expression equivalent to the
ffs(), __ffs() and ffz() function of the kernel. The kernel uses
optimized assembly which produces better code than the builtin
functions. However, such assembly code can not be optimized when used
on constant expression.

This series relies on __builtin_constant_p to select the optimal solution:

  * use kernel assembly for non constant expressions

  * use compiler's __builtin function for constant expressions.

I also think that the fls() and fls64() can be optimized in a similar
way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
trivial so I want to focus on this series first. If it get accepted, I
will then work on those two additionnal function.


** Statistics **

On a allyesconfig, before applying this series, I get:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 1081

After applying this series:

| $ objdump -d vmlinux.o | grep bsf | wc -l
| 792

So, roughly 26.7% of the call to either ffs() or __ffs() were using
constant expression and can be optimized (I did not produce the
figures for ffz()).

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)


Vincent Mailhol (2):
  x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
    expressions
  x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
    expressions

 arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++--------------
 1 file changed, 40 insertions(+), 25 deletions(-)

-- 
2.35.1

Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

Posted by Nick Desaulniers 3 years, 12 months ago

On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> The compilers provides some builtin expression equivalent to the
> ffs(), __ffs() and ffz() function of the kernel. The kernel uses
> optimized assembly which produces better code than the builtin
> functions. However, such assembly code can not be optimized when used
> on constant expression.
>
> This series relies on __builtin_constant_p to select the optimal solution:
>
>   * use kernel assembly for non constant expressions
>
>   * use compiler's __builtin function for constant expressions.
>
> I also think that the fls() and fls64() can be optimized in a similar
> way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> trivial so I want to focus on this series first. If it get accepted, I
> will then work on those two additionnal function.
>
>
> ** Statistics **
>
> On a allyesconfig, before applying this series, I get:
>
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 1081
>
> After applying this series:
>
> | $ objdump -d vmlinux.o | grep bsf | wc -l
> | 792
>
> So, roughly 26.7% of the call to either ffs() or __ffs() were using
> constant expression and can be optimized (I did not produce the
> figures for ffz()).

These stats are interesting; consider putting them on patch 1/2 commit
message though (in addition to the cover letter). (Sending thoughts on
1/2 next).

>
> (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Here's the same measure of x86_64 allyesconfig (./scripts/config -d
CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT
LLVM (~clang-15):

Before:
$ objdump -d vmlinux.o | grep bsf | wc -l
1454

After:
$ objdump -d vmlinux.o | grep bsf | wc -l
1070

-26.4% :)


>
>
> Vincent Mailhol (2):
>   x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
>     expressions
>   x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
>     expressions
>
>  arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++--------------
>  1 file changed, 40 insertions(+), 25 deletions(-)
>
> --
> 2.35.1
>


--
Thanks,
~Nick Desaulniers

Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

Posted by Vincent MAILHOL 3 years, 12 months ago

On Wed. 11 May 2022 at 07:14, Nick Desaulniers <ndesaulniers@google.com> wrote:
> On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
> <mailhol.vincent@wanadoo.fr> wrote:
> >
> > The compilers provides some builtin expression equivalent to the
> > ffs(), __ffs() and ffz() function of the kernel. The kernel uses
> > optimized assembly which produces better code than the builtin
> > functions. However, such assembly code can not be optimized when used
> > on constant expression.
> >
> > This series relies on __builtin_constant_p to select the optimal solution:
> >
> >   * use kernel assembly for non constant expressions
> >
> >   * use compiler's __builtin function for constant expressions.
> >
> > I also think that the fls() and fls64() can be optimized in a similar
> > way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> > trivial so I want to focus on this series first. If it get accepted, I
> > will then work on those two additionnal function.
> >
> >
> > ** Statistics **
> >
> > On a allyesconfig, before applying this series, I get:
> >
> > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > | 1081
> >
> > After applying this series:
> >
> > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > | 792
> >
> > So, roughly 26.7% of the call to either ffs() or __ffs() were using
> > constant expression and can be optimized (I did not produce the
> > figures for ffz()).
>
> These stats are interesting; consider putting them on patch 1/2 commit
> message though (in addition to the cover letter). (Sending thoughts on
> 1/2 next).

The fact is that patch 1/2 changes ffs() and patch 2/2 changes
__ffs(). For v2, I will run the stats on each patch separately in
order not to mix the results.

> >
> > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
>
> Here's the same measure of x86_64 allyesconfig (./scripts/config -d
> CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT
> LLVM (~clang-15):
>
> Before:
> $ objdump -d vmlinux.o | grep bsf | wc -l
> 1454
>
> After:
> $ objdump -d vmlinux.o | grep bsf | wc -l
> 1070
>
> -26.4% :)

Roughly same ratio. I am just surprise that the absolute number
are different:

  * GCC before: 1081, after 792
  * clang before 1454, after 1070

I wonder why clang produces more bsf instructions than GCC?

Also, on a side note, I am not the first one to realize that
__builtin_ffs() is able to optimize the constant variable. Some
people already used it to locally:

| $ git grep __builtin_ffs | wc -l
| 80

> >
> >
> > Vincent Mailhol (2):
> >   x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant
> >     expressions
> >   x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant
> >     expressions
> >
> >  arch/x86/include/asm/bitops.h | 65 +++++++++++++++++++++--------------
> >  1 file changed, 40 insertions(+), 25 deletions(-)
> >
> > --
> > 2.35.1
> >
>
>
> --
> Thanks,
> ~Nick Desaulniers

Re: [PATCH 0/2] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

Posted by Vincent MAILHOL 3 years, 12 months ago

On Wed. 11 mai 2022 at 08:24, Vincent MAILHOL
<mailhol.vincent@wanadoo.fr> wrote:
> On Wed. 11 May 2022 at 07:14, Nick Desaulniers <ndesaulniers@google.com> wrote:
> > On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
> > <mailhol.vincent@wanadoo.fr> wrote:
> > >
> > > The compilers provides some builtin expression equivalent to the
> > > ffs(), __ffs() and ffz() function of the kernel. The kernel uses
> > > optimized assembly which produces better code than the builtin
> > > functions. However, such assembly code can not be optimized when used
> > > on constant expression.
> > >
> > > This series relies on __builtin_constant_p to select the optimal solution:
> > >
> > >   * use kernel assembly for non constant expressions
> > >
> > >   * use compiler's __builtin function for constant expressions.
> > >
> > > I also think that the fls() and fls64() can be optimized in a similar
> > > way, using __builtin_ctz() and __builtin_ctzll() but it is a bit less
> > > trivial so I want to focus on this series first. If it get accepted, I
> > > will then work on those two additionnal function.
> > >
> > >
> > > ** Statistics **
> > >
> > > On a allyesconfig, before applying this series, I get:
> > >
> > > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > > | 1081
> > >
> > > After applying this series:
> > >
> > > | $ objdump -d vmlinux.o | grep bsf | wc -l
> > > | 792
> > >
> > > So, roughly 26.7% of the call to either ffs() or __ffs() were using
> > > constant expression and can be optimized (I did not produce the
> > > figures for ffz()).
> >
> > These stats are interesting; consider putting them on patch 1/2 commit
> > message though (in addition to the cover letter). (Sending thoughts on
> > 1/2 next).
>
> The fact is that patch 1/2 changes ffs() and patch 2/2 changes
> __ffs(). For v2, I will run the stats on each patch separately in
> order not to mix the results.
>
> > >
> > > (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
> >
> > Here's the same measure of x86_64 allyesconfig (./scripts/config -d
> > CONFIG_HINIC) at 9be9ed2612b5aedb52a2c240edb1630b6b743cb6 with ToT
> > LLVM (~clang-15):
> >
> > Before:
> > $ objdump -d vmlinux.o | grep bsf | wc -l
> > 1454
> >
> > After:
> > $ objdump -d vmlinux.o | grep bsf | wc -l
> > 1070
> >
> > -26.4% :)
>
> Roughly same ratio. I am just surprise that the absolute number
> are different:
>
>   * GCC before: 1081, after 792
>   * clang before 1454, after 1070
>
> I wonder why clang produces more bsf instructions than GCC?

Did not find the answer yet, but while looking at this, I found
another interesting thing: on x86_64 the bsf instruction produces
tzcnt when used with the ret prefix. So ffs() produces bsf assembly
instructions but __ffs() and ffz() produces tzcnt. c.f.

http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

I will update the figures in v2 and benchmark both bsf and tzcnt.

> Also, on a side note, I am not the first one to realize that
> __builtin_ffs() is able to optimize the constant variable. Some
> people already used it to locally:
>
> | $ git grep __builtin_ffs | wc -l
> | 80