[v4] bitops: optimize code and add tests

[PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Vincent Mailhol 2 years ago

The compiler is not able to do constant folding on "asm volatile" code.

Evaluate whether or not the function argument is a constant expression
and if this is the case, return an equivalent builtin expression.

On linux 6.7 with an allyesconfig and GCC 13.2.1, it saves roughly 11 KB.

  $ size --format=GNU vmlinux.before vmlinux.after
    text       data        bss      total filename
    60457964   70953697    2288644  133700305 vmlinux.before
    60441196   70957057    2290724  133688977 vmlinux.after

Reference: commit fdb6649ab7c1 ("x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions")
Link: https://git.kernel.org/torvalds/c/fdb6649ab7c1

Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/m68k/include/asm/bitops.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
index a8b23f897f24..02ec8a193b96 100644
--- a/arch/m68k/include/asm/bitops.h
+++ b/arch/m68k/include/asm/bitops.h
@@ -469,6 +469,9 @@ static __always_inline unsigned long ffz(unsigned long word)
 {
 	int res;
 
+	if (__builtin_constant_p(word))
+		return __builtin_ctzl(~word);
+
 	__asm__ __volatile__ ("bfffo %1{#0,#0},%0"
 			      : "=d" (res) : "d" (~word & -~word));
 	return res ^ 31;
@@ -490,6 +493,9 @@ static __always_inline unsigned long ffz(unsigned long word)
 	!defined(CONFIG_M68000)
 static __always_inline unsigned long __ffs(unsigned long x)
 {
+	if (__builtin_constant_p(x))
+		return __builtin_ctzl(x);
+
 	__asm__ __volatile__ ("bitrev %0; ff1 %0"
 		: "=d" (x)
 		: "0" (x));
@@ -522,6 +528,9 @@ static __always_inline int ffs(int x)
 {
 	int cnt;
 
+	if (__builtin_constant_p(x))
+		return __builtin_ffs(x);
+
 	__asm__ ("bfffo %1{#0:#0},%0"
 		: "=d" (cnt)
 		: "dm" (x & -x));
@@ -540,6 +549,9 @@ static __always_inline int fls(unsigned int x)
 {
 	int cnt;
 
+	if (__builtin_constant_p(x))
+		return x ? BITS_PER_TYPE(x) - __builtin_clz(x) : 0;
+
 	__asm__ ("bfffo %1{#0,#0},%0"
 		: "=d" (cnt)
 		: "dm" (x));
-- 
2.43.0

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Finn Thain 2 years ago

On Sun, 28 Jan 2024, Vincent Mailhol wrote:

> The compiler is not able to do constant folding on "asm volatile" code.
> 
> Evaluate whether or not the function argument is a constant expression
> and if this is the case, return an equivalent builtin expression.
> 
> On linux 6.7 with an allyesconfig and GCC 13.2.1, it saves roughly 11 KB.
> 
>   $ size --format=GNU vmlinux.before vmlinux.after
>     text       data        bss      total filename
>     60457964   70953697    2288644  133700305 vmlinux.before
>     60441196   70957057    2290724  133688977 vmlinux.after
> 
> Reference: commit fdb6649ab7c1 ("x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions")
> Link: https://git.kernel.org/torvalds/c/fdb6649ab7c1
> 
> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
> ---
>  arch/m68k/include/asm/bitops.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
> index a8b23f897f24..02ec8a193b96 100644
> --- a/arch/m68k/include/asm/bitops.h
> +++ b/arch/m68k/include/asm/bitops.h
> @@ -469,6 +469,9 @@ static __always_inline unsigned long ffz(unsigned long word)
>  {
>  	int res;
>  
> +	if (__builtin_constant_p(word))
> +		return __builtin_ctzl(~word);
> +
>  	__asm__ __volatile__ ("bfffo %1{#0,#0},%0"
>  			      : "=d" (res) : "d" (~word & -~word));
>  	return res ^ 31;

If the builtin has the desired behaviour, why do we reimplement it in asm? 
Shouldn't we abandon one or the other to avoid having to prove (and 
maintain) their equivalence?

> @@ -490,6 +493,9 @@ static __always_inline unsigned long ffz(unsigned long word)
>  	!defined(CONFIG_M68000)
>  static __always_inline unsigned long __ffs(unsigned long x)
>  {
> +	if (__builtin_constant_p(x))
> +		return __builtin_ctzl(x);
> +
>  	__asm__ __volatile__ ("bitrev %0; ff1 %0"
>  		: "=d" (x)
>  		: "0" (x));
> @@ -522,6 +528,9 @@ static __always_inline int ffs(int x)
>  {
>  	int cnt;
>  
> +	if (__builtin_constant_p(x))
> +		return __builtin_ffs(x);
> +
>  	__asm__ ("bfffo %1{#0:#0},%0"
>  		: "=d" (cnt)
>  		: "dm" (x & -x));
> @@ -540,6 +549,9 @@ static __always_inline int fls(unsigned int x)
>  {
>  	int cnt;
>  
> +	if (__builtin_constant_p(x))
> +		return x ? BITS_PER_TYPE(x) - __builtin_clz(x) : 0;
> +
>  	__asm__ ("bfffo %1{#0,#0},%0"
>  		: "=d" (cnt)
>  		: "dm" (x));
>

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Vincent MAILHOL 2 years ago

On Sun. 28 Jan. 2024 at 14:39, Finn Thain <fthain@linux-m68k.org> wrote:
> On Sun, 28 Jan 2024, Vincent Mailhol wrote:
>
> > The compiler is not able to do constant folding on "asm volatile" code.
> >
> > Evaluate whether or not the function argument is a constant expression
> > and if this is the case, return an equivalent builtin expression.
> >
> > On linux 6.7 with an allyesconfig and GCC 13.2.1, it saves roughly 11 KB.
> >
> >   $ size --format=GNU vmlinux.before vmlinux.after
> >     text       data        bss      total filename
> >     60457964   70953697    2288644  133700305 vmlinux.before
> >     60441196   70957057    2290724  133688977 vmlinux.after
> >
> > Reference: commit fdb6649ab7c1 ("x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions")
> > Link: https://git.kernel.org/torvalds/c/fdb6649ab7c1
> >
> > Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
> > Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
> > ---
> >  arch/m68k/include/asm/bitops.h | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
> > index a8b23f897f24..02ec8a193b96 100644
> > --- a/arch/m68k/include/asm/bitops.h
> > +++ b/arch/m68k/include/asm/bitops.h
> > @@ -469,6 +469,9 @@ static __always_inline unsigned long ffz(unsigned long word)
> >  {
> >       int res;
> >
> > +     if (__builtin_constant_p(word))
> > +             return __builtin_ctzl(~word);
> > +
> >       __asm__ __volatile__ ("bfffo %1{#0,#0},%0"
> >                             : "=d" (res) : "d" (~word & -~word));
> >       return res ^ 31;
>
> If the builtin has the desired behaviour, why do we reimplement it in asm?
> Shouldn't we abandon one or the other to avoid having to prove (and
> maintain) their equivalence?

The asm is meant to produce better results when the argument is not a
constant expression. Below commit is a good illustration of why we
want both the asm and the built:

  https://git.kernel.org/torvalds/c/146034fed6ee

I say "is meant", because I did not assert whether this is still true.
Note that there are some cases in which the asm is not better anymore,
for example, see this thread:

  https://lore.kernel.org/lkml/20221106095106.849154-2-mailhol.vincent@wanadoo.fr/

but I did not receive more answers, so I stopped trying to investigate
the subject.

If you want, you can check the produced assembly of both the asm and
the builtin for both clang and gcc, and if the builtin is always
either better or equivalent, then the asm can be removed. That said, I
am not spending more effort there after being ghosted once (c.f. above
thread).

> > @@ -490,6 +493,9 @@ static __always_inline unsigned long ffz(unsigned long word)
> >       !defined(CONFIG_M68000)
> >  static __always_inline unsigned long __ffs(unsigned long x)
> >  {
> > +     if (__builtin_constant_p(x))
> > +             return __builtin_ctzl(x);
> > +
> >       __asm__ __volatile__ ("bitrev %0; ff1 %0"
> >               : "=d" (x)
> >               : "0" (x));
> > @@ -522,6 +528,9 @@ static __always_inline int ffs(int x)
> >  {
> >       int cnt;
> >
> > +     if (__builtin_constant_p(x))
> > +             return __builtin_ffs(x);
> > +
> >       __asm__ ("bfffo %1{#0:#0},%0"
> >               : "=d" (cnt)
> >               : "dm" (x & -x));
> > @@ -540,6 +549,9 @@ static __always_inline int fls(unsigned int x)
> >  {
> >       int cnt;
> >
> > +     if (__builtin_constant_p(x))
> > +             return x ? BITS_PER_TYPE(x) - __builtin_clz(x) : 0;
> > +
> >       __asm__ ("bfffo %1{#0,#0},%0"
> >               : "=d" (cnt)
> >               : "dm" (x));
> >

RE: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by David Laight 2 years ago

From: Vincent MAILHOL
> Sent: 28 January 2024 06:27
> 
> On Sun. 28 Jan. 2024 at 14:39, Finn Thain <fthain@linux-m68k.org> wrote:
> > On Sun, 28 Jan 2024, Vincent Mailhol wrote:
> >
> > > The compiler is not able to do constant folding on "asm volatile" code.
> > >
> > > Evaluate whether or not the function argument is a constant expression
> > > and if this is the case, return an equivalent builtin expression.
> > >
...
> > If the builtin has the desired behaviour, why do we reimplement it in asm?
> > Shouldn't we abandon one or the other to avoid having to prove (and
> > maintain) their equivalence?
> 
> The asm is meant to produce better results when the argument is not a
> constant expression. Below commit is a good illustration of why we
> want both the asm and the built:
> 
>   https://git.kernel.org/torvalds/c/146034fed6ee
> 
> I say "is meant", because I did not assert whether this is still true.
> Note that there are some cases in which the asm is not better anymore,
> for example, see this thread:
> 
>   https://lore.kernel.org/lkml/20221106095106.849154-2-mailhol.vincent@wanadoo.fr/
> 
> but I did not receive more answers, so I stopped trying to investigate
> the subject.
> 
> If you want, you can check the produced assembly of both the asm and
> the builtin for both clang and gcc, and if the builtin is always
> either better or equivalent, then the asm can be removed. That said, I
> am not spending more effort there after being ghosted once (c.f. above
> thread).

I don't see any example there of why the __builtin_xxx() versions
shouldn't be used all the time.
(The x86-64 asm blocks contain unrelated call instructions and objdump
wasn't passed -d to show what they were.
One even has the 'return thunk pessimisation showing.)

I actually suspect the asm versions predate the builtins.

Does (or can) the outer common header use the __builtin functions
if no asm version exists?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Vincent MAILHOL 2 years ago

On Sun. 28 janv. 2024 at 21:16, David Laight <David.Laight@aculab.com> wrote:
> From: Vincent MAILHOL
> > Sent: 28 January 2024 06:27
> >
> > On Sun. 28 Jan. 2024 at 14:39, Finn Thain <fthain@linux-m68k.org> wrote:
> > > On Sun, 28 Jan 2024, Vincent Mailhol wrote:
> > >
> > > > The compiler is not able to do constant folding on "asm volatile" code.
> > > >
> > > > Evaluate whether or not the function argument is a constant expression
> > > > and if this is the case, return an equivalent builtin expression.
> > > >
> ...
> > > If the builtin has the desired behaviour, why do we reimplement it in asm?
> > > Shouldn't we abandon one or the other to avoid having to prove (and
> > > maintain) their equivalence?
> >
> > The asm is meant to produce better results when the argument is not a
> > constant expression. Below commit is a good illustration of why we
> > want both the asm and the built:
> >
> >   https://git.kernel.org/torvalds/c/146034fed6ee
> >
> > I say "is meant", because I did not assert whether this is still true.
> > Note that there are some cases in which the asm is not better anymore,
> > for example, see this thread:
> >
> >   https://lore.kernel.org/lkml/20221106095106.849154-2-mailhol.vincent@wanadoo.fr/
> >
> > but I did not receive more answers, so I stopped trying to investigate
> > the subject.
> >
> > If you want, you can check the produced assembly of both the asm and
> > the builtin for both clang and gcc, and if the builtin is always
> > either better or equivalent, then the asm can be removed. That said, I
> > am not spending more effort there after being ghosted once (c.f. above
> > thread).
>
> I don't see any example there of why the __builtin_xxx() versions
> shouldn't be used all the time.
> (The x86-64 asm blocks contain unrelated call instructions and objdump
> wasn't passed -d to show what they were.
> One even has the 'return thunk pessimisation showing.)

Fair. My goal was not to point to the assembly code but to this sentence:

  However, for non constant expressions, the kernel's ffs() asm
  version remains better for x86_64 because, contrary to GCC, it
  doesn't emit the CMOV assembly instruction

I should have been more clear. Sorry for that.

But the fact remains, on x86, some of the asm produced more optimized
code than the builtin.

> I actually suspect the asm versions predate the builtins.

This seems true. The __bultins were introduced in:

  generic: Implement generic ffs/fls using __builtin_* functions
  https://git.kernel.org/torvalds/c/048fa2df92c3

when the asm implementation already existed in m68k.

> Does (or can) the outer common header use the __builtin functions
> if no asm version exists?

Yes, this would be extremely easy. You just need to

  #include/asm-generic/bitops/builtin-__ffs.h
  #include/asm-generic/bitops/builtin-ffs.h
  #include/asm-generic/bitops/builtin-__fls.h
  #include/asm-generic/bitops/builtin-fls.h

and remove all the asm implementations. If you give me your green
light, I can do that change. This is trivial. The only thing I am not
ready to do is to compare the produced assembly code and confirm
whether or not it is better to remove asm code.

Thoughts?

Yours sincerely,
Vincent Mailhol

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Finn Thain 2 years ago

On Sun, 28 Jan 2024, Vincent MAILHOL wrote:

> > > The asm is meant to produce better results when the argument is not 
> > > a constant expression.

Is that because gcc's implementation has to satisfy requirements that are 
excessively stringent for the kernel's purposes? Or is it a compiler 
deficiency only affecting certain architectures?

> ... The only thing I am not ready to do is to compare the produced 
> assembly code and confirm whether or not it is better to remove asm 
> code.
> 

If you do the comparison and find no change, you get to say so in the 
commit log, and everyone is happy.

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Vincent MAILHOL 2 years ago

Hi Finn and David,

Sorry for the late feedback, I did not have much time during weekdays.

On Monday. 29 Jan. 2024 at 07:34, Finn Thain <fthain@linux-m68k.org> wrote:
> On Sun, 28 Jan 2024, Vincent MAILHOL wrote:
> > > > The asm is meant to produce better results when the argument is not
> > > > a constant expression.
>
> Is that because gcc's implementation has to satisfy requirements that are
> excessively stringent for the kernel's purposes? Or is it a compiler
> deficiency only affecting certain architectures?

I just guess that GCC guys followed the Intel datasheet while the
kernel guys like to live a bit more dangerously and rely on some not
so well defined behaviours... But I am really not the person to whom
you should ask.

I just want to optimize the constant folding and this is the only
purpose of this series. I am absolutely not an asm. That's also why I
am reluctant to compare the asm outputs.

> > ... The only thing I am not ready to do is to compare the produced
> > assembly code and confirm whether or not it is better to remove asm
> > code.
> >
>
> If you do the comparison and find no change, you get to say so in the
> commit log, and everyone is happy.

Without getting into details, here is a quick comparaisons of gcc and
clang generated asm for all the bitops builtin:

  https://godbolt.org/z/Yb8nMKnYf

To the extent of my limited knowledge, it looks rather OK for gcc, but
for clang... seems that this is the default software implementation.

So are we fine with the current patch? Or maybe clang support is not
important for m68k? I do not know so tell me :)

Yours sincerely,
Vincent Mailhol

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Finn Thain 2 years ago

On Sun, 4 Feb 2024, Vincent MAILHOL wrote:

> Sorry for the late feedback, I did not have much time during weekdays.
> 
> On Monday. 29 Jan. 2024 at 07:34, Finn Thain <fthain@linux-m68k.org> wrote:
> > On Sun, 28 Jan 2024, Vincent MAILHOL wrote:
> > > > > The asm is meant to produce better results when the argument is not
> > > > > a constant expression.
> >
> > Is that because gcc's implementation has to satisfy requirements that are
> > excessively stringent for the kernel's purposes? Or is it a compiler
> > deficiency only affecting certain architectures?
> 
> I just guess that GCC guys followed the Intel datasheet while the
> kernel guys like to live a bit more dangerously and rely on some not
> so well defined behaviours... But I am really not the person to whom
> you should ask.
> 
> I just want to optimize the constant folding and this is the only
> purpose of this series. I am absolutely not an asm. That's also why I
> am reluctant to compare the asm outputs.
> 

How does replacing asm with a builtin prevent constant folding?

> > > ... The only thing I am not ready to do is to compare the produced
> > > assembly code and confirm whether or not it is better to remove asm
> > > code.
> > >
> >
> > If you do the comparison and find no change, you get to say so in the
> > commit log, and everyone is happy.
> 
> Without getting into details, here is a quick comparaisons of gcc and
> clang generated asm for all the bitops builtin:
> 
>   https://godbolt.org/z/Yb8nMKnYf
> 
> To the extent of my limited knowledge, it looks rather OK for gcc, but
> for clang... seems that this is the default software implementation.
> 
> So are we fine with the current patch? Or maybe clang support is not
> important for m68k? I do not know so tell me :)
> 

Let's see if I understand.

You are proposing that the kernel source carry an unquantified 
optimization, with inherent complexity and maintenance costs, just for the 
benefit of users who choose a compiler that doesn't work as well as the 
standard compiler. Is that it?

At some point in the future when clang comes up to scrach with gcc and the 
builtin reaches parity with the asm, I wonder if you will then remove both 
your optimization and the asm, to eliminate the afore-mentioned complexity 
and maintenance costs. Is there an incentive for that work?

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Vincent MAILHOL 2 years ago

On Mon. 5 Feb. 2024 at 08:13, Finn Thain <fthain@linux-m68k.org> wrote:
> On Sun, 4 Feb 2024, Vincent MAILHOL wrote:
(...)
> Let's see if I understand.
>
> You are proposing that the kernel source carry an unquantified
> optimization, with inherent complexity and maintenance costs, just for the
> benefit of users who choose a compiler that doesn't work as well as the
> standard compiler. Is that it?

My proposal is quantified, c.f. my commit message:

  On an allyesconfig, with GCC 13.2.1, it saves roughly 8 KB.

"Saving roughly 8kb" is a quantification.

And clang use in the kernel is standardized:

  https://www.kernel.org/doc/html/latest/process/changes.html#current-minimal-requirements

GCC may be predominant, but, although optional, clang v11+ is
officially supported, and thus my patches should not neglect it.

This is why I am asking whether or not clang support is important or
not for m68k. If you tell me it is not, then fine, I will remove all
the asm (by the way, the patch is already ready). But if there are
even a few users who care about clang for m68k, then I do not think we
should penalize them and I would not sign-off a change which
negatively impacts some users.

The linux-m68k mailing list should know better than me if people care
about clang support. So let me reiterate the question from my previous
message: is clang support important for you?

I would like a formal acknowledgement from the linux-m68k mailing list
before sending a patch that may negatively impact some users. Thank
you.

> At some point in the future when clang comes up to scrach with gcc and the
> builtin reaches parity with the asm, I wonder if you will then remove both
> your optimization and the asm, to eliminate the afore-mentioned complexity
> and maintenance costs. Is there an incentive for that work?

I will not follow up the evolution of clang support for m68k builtins.
The goal of this series is to add a test to assert that all
architectures correctly do the constant folding on the bit find
operations (fifth patch of this series). It happens that m68k and
hexagon architectures are failing that test, so I am fixing this as a
one shot. After this series, I have no plan to do further work on m68k
and hexagon architectures. Thank you for your understanding.

Yours sincerely,
Vincent Mailhol

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Finn Thain 2 years ago

On Mon, 5 Feb 2024, Vincent MAILHOL wrote:

> 
> This is why I am asking whether or not clang support is important or not 
> for m68k. If you tell me it is not, then fine, I will remove all the asm 
> (by the way, the patch is already ready). But if there are even a few 
> users who care about clang for m68k, then I do not think we should 
> penalize them and I would not sign-off a change which negatively impacts 
> some users.
> 

If clang support is important then clang's builtins are important. So why 
not improve those instead? That would resolve the issue in a win-win.

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Vincent MAILHOL 2 years ago

On Mon. 5 Feb. 2024 at 18:48, Finn Thain <fthain@linux-m68k.org> wrote:
> On Mon, 5 Feb 2024, Vincent MAILHOL wrote:
>
> >
> > This is why I am asking whether or not clang support is important or not
> > for m68k. If you tell me it is not, then fine, I will remove all the asm
> > (by the way, the patch is already ready). But if there are even a few
> > users who care about clang for m68k, then I do not think we should
> > penalize them and I would not sign-off a change which negatively impacts
> > some users.
> >
>
> If clang support is important then clang's builtins are important. So why
> not improve those instead? That would resolve the issue in a win-win.

I am deeply sorry, but with all your respect, this request is unfair.
I will not fix the compiler.

Let me repeat my question for the third time: are you (or any other
m68k maintainer) ready to acknowledge that we can degrade the
performance for clang m68k users? With that acknowledgement, I will
remove the asm and replace it with the builtins.

Yours sincerely,
Vincent Mailhol

Re: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by Finn Thain 2 years ago

On Mon, 5 Feb 2024, Vincent MAILHOL wrote:

> On Mon. 5 Feb. 2024 at 18:48, Finn Thain <fthain@linux-m68k.org> wrote:
> > On Mon, 5 Feb 2024, Vincent MAILHOL wrote:
> >
> > If clang support is important then clang's builtins are important. So 
> > why not improve those instead? That would resolve the issue in a 
> > win-win.
> 
> I am deeply sorry, but with all your respect, this request is unfair. I 
> will not fix the compiler.
> 

Absolutely. It is unfair. Just as your proposal was unfair to maintainers. 
That patch would make a compiler deficiency into a maintenance burden.

[PATCH] m68k/bitops: always use compiler's builtin for bit finding functions

Posted by Vincent Mailhol 2 years ago

GCC and clang provides a set of builtin functions which can be used as
a replacement for ffs(), __ffs(), fls() and __fls().

Using the builtin comes as a trade-off.

Pros:

  - Simpler code, easier to maintain
  - Guarantee the constant folding

Cons:

  - clang does not provide optimized code. Ref:
    https://godbolt.org/z/Yb8nMKnYf

Despite of the cons, use the builtin unconditionally and remove any
existing assembly implementation.

This done, remove the find_first_zero_bit(), find_next_zero_bit(),
find_first_bit() and find_next_bit() assembly implementations.
Instead, rely on the generic functions from linux/find.h which will
themselves rely on the builtin we just set up.

Not-signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
As written earlier, I do not want to sign-off a patch which can
degrade the performances of m68k clang users. But because it is
already written, there it is.

If someone wants to pick up this, go ahead. Just make sure to remove
any reference to myself. I hereby grant permission for anyone to reuse
that patch, with or without modifications, under the unique condition
that my name gets removed.
---
 arch/m68k/include/asm/bitops.h | 215 +--------------------------------
 1 file changed, 5 insertions(+), 210 deletions(-)

diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
index 14c64a6f1217..44aac8ced9fc 100644
--- a/arch/m68k/include/asm/bitops.h
+++ b/arch/m68k/include/asm/bitops.h
@@ -340,218 +340,13 @@ static inline bool xor_unlock_is_negative_byte(unsigned long mask,
 #endif
 }
 
-/*
- *	The true 68020 and more advanced processors support the "bfffo"
- *	instruction for finding bits. ColdFire and simple 68000 parts
- *	(including CPU32) do not support this. They simply use the generic
- *	functions.
- */
-#if defined(CONFIG_CPU_HAS_NO_BITFIELDS)
-#include <asm-generic/bitops/ffz.h>
-#else
-
-static inline int find_first_zero_bit(const unsigned long *vaddr,
-				      unsigned size)
-{
-	const unsigned long *p = vaddr;
-	int res = 32;
-	unsigned int words;
-	unsigned long num;
-
-	if (!size)
-		return 0;
-
-	words = (size + 31) >> 5;
-	while (!(num = ~*p++)) {
-		if (!--words)
-			goto out;
-	}
-
-	__asm__ __volatile__ ("bfffo %1{#0,#0},%0"
-			      : "=d" (res) : "d" (num & -num));
-	res ^= 31;
-out:
-	res += ((long)p - (long)vaddr - 4) * 8;
-	return res < size ? res : size;
-}
-#define find_first_zero_bit find_first_zero_bit
-
-static inline int find_next_zero_bit(const unsigned long *vaddr, int size,
-				     int offset)
-{
-	const unsigned long *p = vaddr + (offset >> 5);
-	int bit = offset & 31UL, res;
-
-	if (offset >= size)
-		return size;
-
-	if (bit) {
-		unsigned long num = ~*p++ & (~0UL << bit);
-		offset -= bit;
-
-		/* Look for zero in first longword */
-		__asm__ __volatile__ ("bfffo %1{#0,#0},%0"
-				      : "=d" (res) : "d" (num & -num));
-		if (res < 32) {
-			offset += res ^ 31;
-			return offset < size ? offset : size;
-		}
-		offset += 32;
-
-		if (offset >= size)
-			return size;
-	}
-	/* No zero yet, search remaining full bytes for a zero */
-	return offset + find_first_zero_bit(p, size - offset);
-}
-#define find_next_zero_bit find_next_zero_bit
-
-static inline int find_first_bit(const unsigned long *vaddr, unsigned size)
-{
-	const unsigned long *p = vaddr;
-	int res = 32;
-	unsigned int words;
-	unsigned long num;
-
-	if (!size)
-		return 0;
-
-	words = (size + 31) >> 5;
-	while (!(num = *p++)) {
-		if (!--words)
-			goto out;
-	}
-
-	__asm__ __volatile__ ("bfffo %1{#0,#0},%0"
-			      : "=d" (res) : "d" (num & -num));
-	res ^= 31;
-out:
-	res += ((long)p - (long)vaddr - 4) * 8;
-	return res < size ? res : size;
-}
-#define find_first_bit find_first_bit
-
-static inline int find_next_bit(const unsigned long *vaddr, int size,
-				int offset)
-{
-	const unsigned long *p = vaddr + (offset >> 5);
-	int bit = offset & 31UL, res;
-
-	if (offset >= size)
-		return size;
-
-	if (bit) {
-		unsigned long num = *p++ & (~0UL << bit);
-		offset -= bit;
-
-		/* Look for one in first longword */
-		__asm__ __volatile__ ("bfffo %1{#0,#0},%0"
-				      : "=d" (res) : "d" (num & -num));
-		if (res < 32) {
-			offset += res ^ 31;
-			return offset < size ? offset : size;
-		}
-		offset += 32;
-
-		if (offset >= size)
-			return size;
-	}
-	/* No one yet, search remaining full bytes for a one */
-	return offset + find_first_bit(p, size - offset);
-}
-#define find_next_bit find_next_bit
-
-/*
- * ffz = Find First Zero in word. Undefined if no zero exists,
- * so code should check against ~0UL first..
- */
-static inline unsigned long ffz(unsigned long word)
-{
-	int res;
-
-	__asm__ __volatile__ ("bfffo %1{#0,#0},%0"
-			      : "=d" (res) : "d" (~word & -~word));
-	return res ^ 31;
-}
-
-#endif
-
 #ifdef __KERNEL__
 
-#if defined(CONFIG_CPU_HAS_NO_BITFIELDS)
-
-/*
- *	The newer ColdFire family members support a "bitrev" instruction
- *	and we can use that to implement a fast ffs. Older Coldfire parts,
- *	and normal 68000 parts don't have anything special, so we use the
- *	generic functions for those.
- */
-#if (defined(__mcfisaaplus__) || defined(__mcfisac__)) && \
-	!defined(CONFIG_M68000)
-static inline unsigned long __ffs(unsigned long x)
-{
-	__asm__ __volatile__ ("bitrev %0; ff1 %0"
-		: "=d" (x)
-		: "0" (x));
-	return x;
-}
-
-static inline int ffs(int x)
-{
-	if (!x)
-		return 0;
-	return __ffs(x) + 1;
-}
-
-#else
-#include <asm-generic/bitops/ffs.h>
-#include <asm-generic/bitops/__ffs.h>
-#endif
-
-#include <asm-generic/bitops/fls.h>
-#include <asm-generic/bitops/__fls.h>
-
-#else
-
-/*
- *	ffs: find first bit set. This is defined the same way as
- *	the libc and compiler builtin ffs routines, therefore
- *	differs in spirit from the above ffz (man ffs).
- */
-static inline int ffs(int x)
-{
-	int cnt;
-
-	__asm__ ("bfffo %1{#0:#0},%0"
-		: "=d" (cnt)
-		: "dm" (x & -x));
-	return 32 - cnt;
-}
-
-static inline unsigned long __ffs(unsigned long x)
-{
-	return ffs(x) - 1;
-}
-
-/*
- *	fls: find last bit set.
- */
-static inline int fls(unsigned int x)
-{
-	int cnt;
-
-	__asm__ ("bfffo %1{#0,#0},%0"
-		: "=d" (cnt)
-		: "dm" (x));
-	return 32 - cnt;
-}
-
-static inline unsigned long __fls(unsigned long x)
-{
-	return fls(x) - 1;
-}
-
-#endif
+#include <asm-generic/bitops/builtin-ffs.h>
+#include <asm-generic/bitops/builtin-__ffs.h>
+#include <asm-generic/bitops/builtin-fls.h>
+#include <asm-generic/bitops/builtin-__fls.h>
+#include <asm-generic/bitops/ffz.h>
 
 /* Simple test-and-set bit locks */
 #define test_and_set_bit_lock	test_and_set_bit
-- 
2.43.0

RE: [PATCH v4 2/5] m68k/bitops: use __builtin_{clz,ctzl,ffs} to evaluate constant expressions

Posted by David Laight 2 years ago

From: Vincent MAILHOL 
> Sent: 28 January 2024 13:28
...
> Fair. My goal was not to point to the assembly code but to this sentence:
> 
>   However, for non constant expressions, the kernel's ffs() asm
>   version remains better for x86_64 because, contrary to GCC, it
>   doesn't emit the CMOV assembly instruction

The comment in the code is:
	 * AMD64 says BSFL won't clobber the dest reg if x==0; Intel64 says the
	 * dest reg is undefined if x==0, but their CPU architect says its
	 * value is written to set it to the same as before, except that the
	 * top 32 bits will be cleared.
I guess gcc isn't willing to act on hearsay!

> 
> I should have been more clear. Sorry for that.
> 
> But the fact remains, on x86, some of the asm produced more optimized
> code than the builtin.
> 
> > I actually suspect the asm versions predate the builtins.
> 
> This seems true. The __bultins were introduced in:
> 
>   generic: Implement generic ffs/fls using __builtin_* functions
>   https://git.kernel.org/torvalds/c/048fa2df92c3

I was thinking of compiler versions not kernel source commits.
You'd need to be doing some serious software archaeology.

> when the asm implementation already existed in m68k.
> 
> > Does (or can) the outer common header use the __builtin functions
> > if no asm version exists?
> 
> Yes, this would be extremely easy. You just need to
> 
>   #include/asm-generic/bitops/builtin-__ffs.h
>   #include/asm-generic/bitops/builtin-ffs.h
>   #include/asm-generic/bitops/builtin-__fls.h
>   #include/asm-generic/bitops/builtin-fls.h
> 
> and remove all the asm implementations. If you give me your green
> light, I can do that change. This is trivial. The only thing I am not
> ready to do is to compare the produced assembly code and confirm
> whether or not it is better to remove asm code.
> 
> Thoughts?

Not for me to say.
But the builtins are likely to generate reasonable code.
So unless the asm can be better (like trusting undocumented
x86 cpu behaviour) using them is probably best.

For the ones you are changing it may be best to propose using
the builtins all the time.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)