[v5] x86/asm/bitops: optimize ff{s,z} functions for constant expressions

[PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Vincent Mailhol 3 years, 8 months ago

__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x). Because
__builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.

However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).

Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.

** Statistics **

On a allyesconfig, before...:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  3607

...and after:

  $ objdump -d vmlinux.o | grep tzcnt | wc -l
  2600

So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.

(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)

Note: on x86_64, the asm bsf instruction produces tzcnt when used with
the ret prefix (which explain the use of `grep tzcnt' instead of `grep
bsf' in above benchmark). c.f. [1]

[1] commit e26a44a2d618 ("x86: Use REP BSF unconditionally")
Link: http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 38 ++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6ed979547086..bd49aef87ab6 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -224,13 +224,7 @@ static __always_inline bool variable_test_bit(long nr, volatile const unsigned l
 	 ? constant_test_bit((nr), (addr))	\
 	 : variable_test_bit((nr), (addr)))
 
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long variable___ffs(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -238,13 +232,18 @@ static __always_inline unsigned long __ffs(unsigned long word)
 	return word;
 }
 
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+#define __ffs(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(word) :	\
+	 variable___ffs(word))
+
+static __always_inline unsigned long variable_ffz(unsigned long word)
 {
 	asm("rep; bsf %1,%0"
 		: "=r" (word)
@@ -252,6 +251,17 @@ static __always_inline unsigned long ffz(unsigned long word)
 	return word;
 }
 
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+#define ffz(word)				\
+	(__builtin_constant_p(word) ?		\
+	 (unsigned long)__builtin_ctzl(~word) :	\
+	 variable_ffz(word))
+
 /*
  * __fls: find last set bit in word
  * @word: The word to search
-- 
2.35.1

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Borislav Petkov 3 years, 7 months ago

On Fri, Aug 12, 2022 at 08:44:38PM +0900, Vincent Mailhol wrote:
> __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x)

Are you sure about this?

My gcc documentation says:

"Built-in Function: int __builtin_ctz (unsigned int x)

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."

Note the undefined part.

Also,

__builtin_ctzl(0): 0x40
ffs(0): 0x0

I'm using the kernel ffs() version in a small program which is basically
a wrapper around BSF.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Nick Desaulniers 3 years, 7 months ago

On Tue, Aug 23, 2022 at 9:23 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Aug 12, 2022 at 08:44:38PM +0900, Vincent Mailhol wrote:
> > __ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x)
>
> Are you sure about this?
>
> My gcc documentation says:
>
> "Built-in Function: int __builtin_ctz (unsigned int x)
>
>     Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
>
> Note the undefined part.
>
> Also,
>
> __builtin_ctzl(0): 0x40
> ffs(0): 0x0
>
> I'm using the kernel ffs() version in a small program which is basically
> a wrapper around BSF.

Callers of these need to guard against zero input, as the pre-existing
comment notes:

>> Undefined if no bit exists, so code should check against 0 first.
-- 
Thanks,
~Nick Desaulniers

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Borislav Petkov 3 years, 7 months ago

On Tue, Aug 23, 2022 at 10:12:17AM -0700, Nick Desaulniers wrote:
> Callers of these need to guard against zero input, as the pre-existing
> comment notes:
> 
> >> Undefined if no bit exists, so code should check against 0 first.

This is just silly.

And then there's

 * ffs(value) returns 0 if value is 0 or the position of the first
 * set bit if value is nonzero. The first (least significant) bit
 * is at position 1.
 */
static __always_inline int ffs(int x)

Can we unify the two and move the guard against 0 inside the function
or, like ffs() does, have it return 0 if value is 0?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Vincent MAILHOL 3 years, 7 months ago

On Wed. 24 Aug. 2022 at 02:43, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Aug 23, 2022 at 10:12:17AM -0700, Nick Desaulniers wrote:
> > Callers of these need to guard against zero input, as the pre-existing
> > comment notes:
> >
> > >> Undefined if no bit exists, so code should check against 0 first.

If the fact that __ffs(0) is undefined is a concern, I can add a safety net:

  #define __ffs(word) \
          (__builtin_constant_p(word) ? \
                  (unsigned long)__builtin_ctzl(word) +
BUILD_BUG_ON_ZERO(word): \
                  variable___ffs(word))

It will only catch the constant expression but still better than
nothing (this comment also applies to the other functions undefined
when the argument is zero).

Also, if this aspect was unclear, I can add a comment in the commit
log to explain.

> This is just silly.
>
> And then there's
>
>  * ffs(value) returns 0 if value is 0 or the position of the first
>  * set bit if value is nonzero. The first (least significant) bit
>  * is at position 1.
>  */
> static __always_inline int ffs(int x)
>
> Can we unify the two and move the guard against 0 inside the function
> or, like ffs() does, have it return 0 if value is 0?

There is an index issue. __ffs() starts at 0 but ffs() starts at one.
i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
explains why __fss(0) is undefined.

Yours sincerely,
Vincent Mailhol

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Borislav Petkov 3 years, 7 months ago

On Wed, Aug 24, 2022 at 05:31:20AM +0900, Vincent MAILHOL wrote:
> If the fact that __ffs(0) is undefined is a concern,

So what is of concern is I'm looking at those *ffs things and they look
like a real mess:

 * Undefined if no bit exists, so code should check against 0 first.
 */
static __always_inline unsigned long __ffs(unsigned long word)
{
        asm("rep; bsf %1,%0"

and that's TZCNT.

And nowhere in TZCNT's description does it talk about undefined behavior
- it is all defined.

So I have no clue what that comment is supposed to mean?

Then:

 * ffs - find first set bit in word
 * @x: the word to search
 *
 * This is defined the same way as the libc and compiler builtin ffs
 * routines, therefore differs in spirit from the other bitops.
 *
 * ffs(value) returns 0 if value is 0 or the position of the first
 * set bit if value is nonzero. The first (least significant) bit
 * is at position 1.

while

"Built-in Function: int __builtin_ctz (unsigned int x)

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."

as previously pasted.

So ffs() doesn't have undefined behavior either.

I guess it wants to say, it is undefined in the *respective* libc or
compiler helper implementation. And that should be explained.

> I can add a safety net:

Nah, no need. It seems this "behavior" has been the case a long time so
callers should know better (or have burned themselves properly :)).

> There is an index issue. __ffs() starts at 0 but ffs() starts at one.
> i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
> Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
> explains why __fss(0) is undefined.

I'd love to drop the undefined thing and start counting at 1 while
keeping the 0 case the special one.

But that ship has sailed a long time ago - look at all the __ffs() and
ffs() callers.

Back to your patch: I think the text should be fixed to say that both
ffs() and __ffs()'s kernel implementation doesn't have undefined results
but since it needs to adhere to the libc variants' API, it treats 0
differently. They surely can handle 0 as input.

I.e., I'd like to see a comment there explaining the whole difference
between ffs() and __ffs() so that people are aware.

Btw, pls do

s/variable___ffs/variable__ffs/g

Two underscores are just fine.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Vincent MAILHOL 3 years, 7 months ago

On Wed. 24 Aug 2022 at 17:43, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 24, 2022 at 05:31:20AM +0900, Vincent MAILHOL wrote:
> > If the fact that __ffs(0) is undefined is a concern,
>
> So what is of concern is I'm looking at those *ffs things and they look
> like a real mess:

I agree that the thing is a mess. Especially the naming: adding
underscores when the behaviour is different is misleading. I think
that ctzl() would have been a better name than __ffs().

>  * Undefined if no bit exists, so code should check against 0 first.
>  */
> static __always_inline unsigned long __ffs(unsigned long word)
> {
>         asm("rep; bsf %1,%0"
>
> and that's TZCNT.

Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…

> And nowhere in TZCNT's description does it talk about undefined behavior
> - it is all defined.
>
> So I have no clue what that comment is supposed to mean?

It means that __ffs() is not a x86_64 specific function. Each
architecture is free to provide an optimized implementation and are
free to ignore __ffs(0) because this is undefined.
For ffs(0) to be defined, every architecture would have to produce the
same result, and this is not the case.

> Then:
>
>  * ffs - find first set bit in word
>  * @x: the word to search
>  *
>  * This is defined the same way as the libc and compiler builtin ffs
>  * routines, therefore differs in spirit from the other bitops.
>  *
>  * ffs(value) returns 0 if value is 0 or the position of the first
>  * set bit if value is nonzero. The first (least significant) bit
>  * is at position 1.
>
> while
>
> "Built-in Function: int __builtin_ctz (unsigned int x)
>
>     Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined."
>
> as previously pasted.
>
> So ffs() doesn't have undefined behavior either.
>
> I guess it wants to say, it is undefined in the *respective* libc or
> compiler helper implementation. And that should be explained.
>
> > I can add a safety net:
>
> Nah, no need. It seems this "behavior" has been the case a long time so
> callers should know better (or have burned themselves properly :)).
>
> > There is an index issue. __ffs() starts at 0 but ffs() starts at one.
> > i.e.: __ffs(0x01) is 0 but ffs(0x01) is 1.
> > Aside from the zero edge case, ffs(x) equals __ffs(x) + 1. This
> > explains why __fss(0) is undefined.
>
> I'd love to drop the undefined thing and start counting at 1 while
> keeping the 0 case the special one.
>
> But that ship has sailed a long time ago - look at all the __ffs() and
> ffs() callers.

ACK. I do not believe that this is something which can be changed now.
At least, I am not willing to start such a crusade.

> Back to your patch: I think the text should be fixed to say that both
> ffs() and __ffs()'s kernel implementation doesn't have undefined results

NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for
x86_64 and BSF instruction for x86). Even if x86_64 and x86 had the
same behaviour that would still not be OK as it may fool developers
into believing that __ffs(0) is defined kernel wide and would result
in non portable code.

> but since it needs to adhere to the libc variants' API, it treats 0
> differently. They surely can handle 0 as input.
>
> I.e., I'd like to see a comment there explaining the whole difference
> between ffs() and __ffs() so that people are aware.

This would be helpful but the priority would then be to modify asm-generic:
https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bitops/__ffs.h#L11

Regardless, I do not think that the comment of __ffs() and ffs() is
related to this patch series.

> Btw, pls do
>
> s/variable___ffs/variable__ffs/g
>
> Two underscores are just fine.

OK for me. The rationale was to name it variable_<function_name>()
thus the three underscores. But I will also be happy with two
underscores.


Yours sincerely,
Vincent Mailhol

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Borislav Petkov 3 years, 7 months ago

On Wed, Aug 24, 2022 at 09:10:59PM +0900, Vincent MAILHOL wrote:
> Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…

Not x86 - some old models which do not understand TZCNT, I'm being told.
And I'm being also told, "Intel and AMD disagree on what BSF does when
passed 0". So this is more mess.

> It means that __ffs() is not a x86_64 specific function. Each

No, not that. The comment "Undefined if no bit exists".

On my machine, __ffs(0) - the way it is implemented:

	rep; bsf %1,%0

is well-defined:

"If the input operand is zero, CF is set to 1 and the size (in bits) of
the input operand is written to the destination register. Otherwise, CF
is cleared."

Leading to

__ffs(0): 0x40

i.e., input operand of 64 bits.

So on this particular x86 implementation, TZCNT(0) is well defined.

So I'd like for that "undefined" thing to be expanded upon and
explained. Something along the lines of "the libc/compiler primitives'
*ffs(0) is undefined. Our inline asm helpers adhere to that behavior
even if the result they return for input operand of 0 is very well
defined."

Now, if there are some machines which do not adhere to the current hw
behavior, then they should be ALTERNATIVEd.

Better?

> > Back to your patch: I think the text should be fixed to say that both
> > ffs() and __ffs()'s kernel implementation doesn't have undefined results
> 
> NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for

NACK, SCHMACK. Read my mail again: "I think the text should be fixed".
The *text* - not __ffs(0) itself. The *text* should be fixed to explain
what undefined means. See above too.

IOW, to start explaining this humongous mess I've scratched the surface
of.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Vincent MAILHOL 3 years, 7 months ago

On Wed. 24 Aug. 2022 at 22:24, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 24, 2022 at 09:10:59PM +0900, Vincent MAILHOL wrote:
> > Not exactly, this is TZCNT for x86_64 but for x86, it will be BSF…
>
> Not x86 - some old models which do not understand TZCNT, I'm being told.

ACK.

> And I'm being also told, "Intel and AMD disagree on what BSF does when
> passed 0". So this is more mess.

ACK.

> > It means that __ffs() is not a x86_64 specific function. Each
>
> No, not that. The comment "Undefined if no bit exists".
>
> On my machine, __ffs(0) - the way it is implemented:
>
>         rep; bsf %1,%0
>
> is well-defined:
>
> "If the input operand is zero, CF is set to 1 and the size (in bits) of
> the input operand is written to the destination register. Otherwise, CF
> is cleared."

It is well defined on *your* machine.

On some other machines, it might be undefined:
"If the content of the source operand is 0, the content of the
destination operand is undefined."
https://www.felixcloutier.com/x86/bsf

> Leading to
>
> __ffs(0): 0x40
>
> i.e., input operand of 64 bits.
>
> So on this particular x86 implementation, TZCNT(0) is well defined.

It is here where I do not follow you. OK that on most of the recent
machines, the compiler will emit a TZCNT and that this instruction is
well defined for zero. But on some older machines, it will emit BSF,
and on a subset of those machines, BSF(0) might be undefined.

> So I'd like for that "undefined" thing to be expanded upon and
> explained. Something along the lines of "the libc/compiler primitives'
> *ffs(0) is undefined. Our inline asm helpers adhere to that behavior
> even if the result they return for input operand of 0 is very well
> defined."
>
> Now, if there are some machines which do not adhere to the current hw
> behavior, then they should be ALTERNATIVEd.
>
> Better?
>
> > > Back to your patch: I think the text should be fixed to say that both
> > > ffs() and __ffs()'s kernel implementation doesn't have undefined results
> >
> > NACK. __ffs(0) is an undefined behaviour (c.f. TZCNT instruction for
>
> NACK, SCHMACK. Read my mail again: "I think the text should be fixed".
> The *text* - not __ffs(0) itself. The *text* should be fixed to explain
> what undefined means. See above too.
>
> IOW, to start explaining this humongous mess I've scratched the surface
> of.

Agree that this is only the surface. But, my patch series is about
constant folding, not about the text of *ffs(). Here, I just *move*
the existing text, I did not modify anything.
Can we agree that this is a separate topic? I do not think I am the
good person to fix that mess (and in all honesty, I am not a domain
expert in this domain and I am afraid I would just make you lose your
time if I had to work on this).


Yours sincerely,
Vincent Mailhol

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Borislav Petkov 3 years, 7 months ago

On Sat, Aug 27, 2022 at 06:32:05AM +0900, Vincent MAILHOL wrote:
> Agree that this is only the surface. But, my patch series is about
> constant folding, not about the text of *ffs(). Here, I just *move*
> the existing text, I did not modify anything.
> Can we agree that this is a separate topic?

Sure we can.

But then you can't start your commit message with:

"__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
is equivalent to (unsigned long)__builtin_ctzl(~x)."

which will bring unenlightened readers like me into the very same mess.

So at least mention that there's a difference between the kernel
implementation using hw insns which are well defined on some machines
and what the glibc API does. So that at least people are aware that
there's something dangerous to be cautious about.

Ok?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Vincent MAILHOL 3 years, 7 months ago

On Wed. 7 Sep 2022 at 13:06, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Aug 27, 2022 at 06:32:05AM +0900, Vincent MAILHOL wrote:
> > Agree that this is only the surface. But, my patch series is about
> > constant folding, not about the text of *ffs(). Here, I just *move*
> > the existing text, I did not modify anything.
> > Can we agree that this is a separate topic?
>
> Sure we can.
>
> But then you can't start your commit message with:
>
> "__ffs(x) is equivalent to (unsigned long)__builtin_ctzl(x) and ffz(x)
> is equivalent to (unsigned long)__builtin_ctzl(~x)."
>
> which will bring unenlightened readers like me into the very same mess.
>
> So at least mention that there's a difference between the kernel
> implementation using hw insns which are well defined on some machines
> and what the glibc API does. So that at least people are aware that
> there's something dangerous to be cautious about.
>
> Ok?

OK.

I rephrased the beginning of the commit message as below:

If x is not 0, __ffs(x) is equivalent to:
  (unsigned long)__builtin_ctzl(x)
And if x is not ~0UL, ffz(x) is equivalent to:
  (unsigned long)__builtin_ctzl(~x)
Because __builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.

Concerning the edge cases, __builtin_ctzl(0) is always undefined,
whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
the processor. Regardless, for both functions, developers are asked to
check against 0 or ~0UL so replacing __ffs() or ffz() by
__builting_ctzl() is safe.

Does this solve the issue? If yes, I will prepare the v8 right away.

Yours sincerely,
Vincent Mailhol

Re: [PATCH v5 2/2] x86/asm/bitops: __ffs,ffz: use __builtin_ctzl to evaluate constant expressions

Posted by Borislav Petkov 3 years, 7 months ago

On Wed, Sep 07, 2022 at 02:35:41PM +0900, Vincent MAILHOL wrote:
> I rephrased the beginning of the commit message as below:
> 
> 
> If x is not 0, __ffs(x) is equivalent to:
>   (unsigned long)__builtin_ctzl(x)
> And if x is not ~0UL, ffz(x) is equivalent to:
>   (unsigned long)__builtin_ctzl(~x)
> Because __builting_ctzl() returns an int, a cast to (unsigned long) is
> necessary to avoid potential warnings on implicit casts.
> 
> Concerning the edge cases, __builtin_ctzl(0) is always undefined,
> whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
> the processor. Regardless, for both functions, developers are asked to
> check against 0 or ~0UL so replacing __ffs() or ffz() by
> __builting_ctzl() is safe.
> 
> 
> 
> Does this solve the issue?

Yes, that sounds better.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette