[PATCH v2] x86/bitops: Optimise arch_ffs{,l}() some more on AMD

Andrew Cooper posted 1 patch 2 months ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20250826174135.605220-1-andrew.cooper3@citrix.com
xen/arch/x86/include/asm/bitops.h | 13 ++++--
xen/common/bitops.c               | 78 ++++++++++++++++++++++++++++++-
2 files changed, 86 insertions(+), 5 deletions(-)
[PATCH v2] x86/bitops: Optimise arch_ffs{,l}() some more on AMD
Posted by Andrew Cooper 2 months ago
One aspect of the old ffs() infrastructure was the use of TZCNT without a
CPUID check.  Given no obvious justification, I dropped it during the cleanup.

It turns out there is a good reason, on all but the most recent AMD CPUs.

Reinstate the use of "blind" TZCNT when safe to do so.  This happens to be
preferentially in loops where a repeated saving of 5-6 uops becomes far more
relevant than a one-off scenario.  Leave an explanation of why.

It is not safe to perform this transformation for BSR and LZCNT, because they
count from opposite ends of the input operand.  Furthermore, this incorrect
transform was not caught by the selftests.

Extend the selftests with test_for_each_set_bit_reverse() as that exercises
the "x known non-zero" optimsiation path in x86's helpers.  Adjust the
existing test_for_each_set_bit() to intentionally use an asymmetric input bit
pattern, otherwise "counting from the wrong end" bugs continue to slip by.

No functional change.

Fixes: 989e5f08d33e ("x86/bitops: Improve arch_ffs() in the general case")
Fixes: 5ed26fc0768d ("xen/bitops: Implement ffsl() in common logic")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Use TZCNT mnemonic given the updated toolchain baseline
 * Do not convert LZCNT.
 * Extend selftests.

https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/2004918013

I've kept Jan's R-by as this is a subset of what was previously reviewed, but
I'm posting it for clarity as there are a lot of selftest changes.
---
 xen/arch/x86/include/asm/bitops.h | 13 ++++--
 xen/common/bitops.c               | 78 ++++++++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/bitops.h b/xen/arch/x86/include/asm/bitops.h
index 87eac7782f10..b17d67e8ca61 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -399,9 +399,16 @@ static always_inline unsigned int arch_ffs(unsigned int x)
          *
          * and the optimiser really can work with the knowledge of x being
          * non-zero without knowing it's exact value, in which case we don't
-         * need to compensate for BSF's corner cases.  Otherwise...
+         * need to compensate for BSF's corner cases.
+         *
+         * That said, we intentionally use TZCNT on capable hardware when the
+         * behaviour for the 0 case doesn't matter.  On AMD CPUs prior to
+         * Zen4, TZCNT is 1-2 uops while BSF is 6-8 with a latency to match.
+         * Intel CPUs don't suffer this discrepancy.
+         *
+         * Otherwise...
          */
-        asm ( "bsf %[val], %[res]"
+        asm ( "tzcnt %[val], %[res]"
               : [res] "=r" (r)
               : [val] "rm" (x) );
     }
@@ -428,7 +435,7 @@ static always_inline unsigned int arch_ffsl(unsigned long x)
 
     /* See arch_ffs() for safety discussions. */
     if ( __builtin_constant_p(x > 0) && x > 0 )
-        asm ( "bsf %[val], %q[res]"
+        asm ( "tzcnt %[val], %q[res]"
               : [res] "=r" (r)
               : [val] "rm" (x) );
     else
diff --git a/xen/common/bitops.c b/xen/common/bitops.c
index bd17ddef5700..14e6265037f1 100644
--- a/xen/common/bitops.c
+++ b/xen/common/bitops.c
@@ -97,14 +97,14 @@ static void __init test_for_each_set_bit(void)
     if ( ui != ui_res )
         panic("for_each_set_bit(uint) expected %#x, got %#x\n", ui, ui_res);
 
-    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 1);
+    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
     for_each_set_bit ( i, ul )
         ul_res |= 1UL << i;
 
     if ( ul != ul_res )
         panic("for_each_set_bit(ulong) expected %#lx, got %#lx\n", ul, ul_res);
 
-    ull = HIDE(0x8000000180000001ULL);
+    ull = HIDE(0x8000000180000011ULL);
     for_each_set_bit ( i, ull )
         ull_res |= 1ULL << i;
 
@@ -127,6 +127,79 @@ static void __init test_for_each_set_bit(void)
         panic("for_each_set_bit(break) expected 0x1008, got %#x\n", ui_res);
 }
 
+/*
+ * A type-generic fls() which picks the appropriate fls{,l,64}() based on it's
+ * argument.
+ */
+#define fls_g(x)                                        \
+    (sizeof(x) <= sizeof(int)      ? fls(x) :           \
+     sizeof(x) <= sizeof(long)     ? flsl(x) :          \
+     sizeof(x) <= sizeof(uint64_t) ? fls64(x) :         \
+     ({ BUILD_ERROR("fls_g() Bad input type"); 0; }))
+
+/*
+ * for_each_set_bit_reverse() - Iterate over all set bits in a scalar value,
+ * from MSB to LSB.
+ *
+ * @iter An iterator name.  Scoped is within the loop only.
+ * @val  A scalar value to iterate over.
+ *
+ * A copy of @val is taken internally.
+ */
+#define for_each_set_bit_reverse(iter, val)             \
+    for ( typeof(val) __v = (val); __v; __v = 0 )       \
+        for ( unsigned int (iter);                      \
+              __v && ((iter) = fls_g(__v) - 1, true);   \
+              __clear_bit(iter, &__v) )
+
+/*
+ * Xen doesn't have need of for_each_set_bit_reverse() at present, but the
+ * construct does exercise a case of arch_fls*() not covered anywhere else by
+ * these tests.
+ */
+static void __init test_for_each_set_bit_reverse(void)
+{
+    unsigned int  ui,  ui_res = 0, tmp;
+    unsigned long ul,  ul_res = 0;
+    uint64_t      ull, ull_res = 0;
+
+    ui = HIDE(0x80008001U);
+    for_each_set_bit_reverse ( i, ui )
+        ui_res |= 1U << i;
+
+    if ( ui != ui_res )
+        panic("for_each_set_bit_reverse(uint) expected %#x, got %#x\n", ui, ui_res);
+
+    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
+    for_each_set_bit_reverse ( i, ul )
+        ul_res |= 1UL << i;
+
+    if ( ul != ul_res )
+        panic("for_each_set_bit_reverse(ulong) expected %#lx, got %#lx\n", ul, ul_res);
+
+    ull = HIDE(0x8000000180000011ULL);
+    for_each_set_bit_reverse ( i, ull )
+        ull_res |= 1ULL << i;
+
+    if ( ull != ull_res )
+        panic("for_each_set_bit_reverse(uint64) expected %#"PRIx64", got %#"PRIx64"\n", ull, ull_res);
+
+    /* Check that we can break from the middle of the loop. */
+    ui = HIDE(0x80001008U);
+    tmp = 0;
+    ui_res = 0;
+    for_each_set_bit_reverse ( i, ui )
+    {
+        if ( tmp++ > 1 )
+            break;
+
+        ui_res |= 1U << i;
+    }
+
+    if ( ui_res != 0x80001000U )
+        panic("for_each_set_bit_reverse(break) expected 0x80001000, got %#x\n", ui_res);
+}
+
 static void __init test_multiple_bits_set(void)
 {
     /*
@@ -169,6 +242,7 @@ static void __init __constructor test_bitops(void)
     test_ffs();
     test_fls();
     test_for_each_set_bit();
+    test_for_each_set_bit_reverse();
 
     test_multiple_bits_set();
     test_hweight();
-- 
2.39.5


Re: [PATCH v2] x86/bitops: Optimise arch_ffs{,l}() some more on AMD
Posted by Jan Beulich 2 months ago
On 26.08.2025 19:41, Andrew Cooper wrote:
> --- a/xen/common/bitops.c
> +++ b/xen/common/bitops.c
> @@ -97,14 +97,14 @@ static void __init test_for_each_set_bit(void)
>      if ( ui != ui_res )
>          panic("for_each_set_bit(uint) expected %#x, got %#x\n", ui, ui_res);
>  
> -    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 1);
> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
>      for_each_set_bit ( i, ul )
>          ul_res |= 1UL << i;
>  
>      if ( ul != ul_res )
>          panic("for_each_set_bit(ulong) expected %#lx, got %#lx\n", ul, ul_res);
>  
> -    ull = HIDE(0x8000000180000001ULL);
> +    ull = HIDE(0x8000000180000011ULL);
>      for_each_set_bit ( i, ull )
>          ull_res |= 1ULL << i;

How do these changes make a difference? Apart from ffs() using TZCNT, ...

> @@ -127,6 +127,79 @@ static void __init test_for_each_set_bit(void)
>          panic("for_each_set_bit(break) expected 0x1008, got %#x\n", ui_res);
>  }
>  
> +/*
> + * A type-generic fls() which picks the appropriate fls{,l,64}() based on it's
> + * argument.
> + */
> +#define fls_g(x)                                        \
> +    (sizeof(x) <= sizeof(int)      ? fls(x) :           \
> +     sizeof(x) <= sizeof(long)     ? flsl(x) :          \
> +     sizeof(x) <= sizeof(uint64_t) ? fls64(x) :         \
> +     ({ BUILD_ERROR("fls_g() Bad input type"); 0; }))
> +
> +/*
> + * for_each_set_bit_reverse() - Iterate over all set bits in a scalar value,
> + * from MSB to LSB.
> + *
> + * @iter An iterator name.  Scoped is within the loop only.
> + * @val  A scalar value to iterate over.
> + *
> + * A copy of @val is taken internally.
> + */
> +#define for_each_set_bit_reverse(iter, val)             \
> +    for ( typeof(val) __v = (val); __v; __v = 0 )       \
> +        for ( unsigned int (iter);                      \
> +              __v && ((iter) = fls_g(__v) - 1, true);   \
> +              __clear_bit(iter, &__v) )
> +
> +/*
> + * Xen doesn't have need of for_each_set_bit_reverse() at present, but the
> + * construct does exercise a case of arch_fls*() not covered anywhere else by
> + * these tests.
> + */
> +static void __init test_for_each_set_bit_reverse(void)
> +{
> +    unsigned int  ui,  ui_res = 0, tmp;
> +    unsigned long ul,  ul_res = 0;
> +    uint64_t      ull, ull_res = 0;
> +
> +    ui = HIDE(0x80008001U);
> +    for_each_set_bit_reverse ( i, ui )
> +        ui_res |= 1U << i;
> +
> +    if ( ui != ui_res )
> +        panic("for_each_set_bit_reverse(uint) expected %#x, got %#x\n", ui, ui_res);
> +
> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
> +    for_each_set_bit_reverse ( i, ul )
> +        ul_res |= 1UL << i;
> +
> +    if ( ul != ul_res )
> +        panic("for_each_set_bit_reverse(ulong) expected %#lx, got %#lx\n", ul, ul_res);
> +
> +    ull = HIDE(0x8000000180000011ULL);
> +    for_each_set_bit_reverse ( i, ull )
> +        ull_res |= 1ULL << i;

... even here the need for the extra setting of bit 4 remains unclear to
me: The thing that was missing was the testing of the reverse for-each.
You mention the need for an asymmetric input in the description, but isn't
that covered already by the first test using 0x80008001U?

That said, I certainly don't mind the change, it just isn't quite clear to
me whether I'm missing something crucial.

Jan
Re: [PATCH v2] x86/bitops: Optimise arch_ffs{,l}() some more on AMD
Posted by Andrew Cooper 1 month, 4 weeks ago
On 27/08/2025 8:52 am, Jan Beulich wrote:
> On 26.08.2025 19:41, Andrew Cooper wrote:
>> --- a/xen/common/bitops.c
>> +++ b/xen/common/bitops.c
>> @@ -97,14 +97,14 @@ static void __init test_for_each_set_bit(void)
>>      if ( ui != ui_res )
>>          panic("for_each_set_bit(uint) expected %#x, got %#x\n", ui, ui_res);
>>  
>> -    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 1);
>> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
>>      for_each_set_bit ( i, ul )
>>          ul_res |= 1UL << i;
>>  
>>      if ( ul != ul_res )
>>          panic("for_each_set_bit(ulong) expected %#lx, got %#lx\n", ul, ul_res);
>>  
>> -    ull = HIDE(0x8000000180000001ULL);
>> +    ull = HIDE(0x8000000180000011ULL);
>>      for_each_set_bit ( i, ull )
>>          ull_res |= 1ULL << i;
> How do these changes make a difference? Apart from ffs() using TZCNT, ...
>
>> @@ -127,6 +127,79 @@ static void __init test_for_each_set_bit(void)
>>          panic("for_each_set_bit(break) expected 0x1008, got %#x\n", ui_res);
>>  }
>>  
>> +/*
>> + * A type-generic fls() which picks the appropriate fls{,l,64}() based on it's
>> + * argument.
>> + */
>> +#define fls_g(x)                                        \
>> +    (sizeof(x) <= sizeof(int)      ? fls(x) :           \
>> +     sizeof(x) <= sizeof(long)     ? flsl(x) :          \
>> +     sizeof(x) <= sizeof(uint64_t) ? fls64(x) :         \
>> +     ({ BUILD_ERROR("fls_g() Bad input type"); 0; }))
>> +
>> +/*
>> + * for_each_set_bit_reverse() - Iterate over all set bits in a scalar value,
>> + * from MSB to LSB.
>> + *
>> + * @iter An iterator name.  Scoped is within the loop only.
>> + * @val  A scalar value to iterate over.
>> + *
>> + * A copy of @val is taken internally.
>> + */
>> +#define for_each_set_bit_reverse(iter, val)             \
>> +    for ( typeof(val) __v = (val); __v; __v = 0 )       \
>> +        for ( unsigned int (iter);                      \
>> +              __v && ((iter) = fls_g(__v) - 1, true);   \
>> +              __clear_bit(iter, &__v) )
>> +
>> +/*
>> + * Xen doesn't have need of for_each_set_bit_reverse() at present, but the
>> + * construct does exercise a case of arch_fls*() not covered anywhere else by
>> + * these tests.
>> + */
>> +static void __init test_for_each_set_bit_reverse(void)
>> +{
>> +    unsigned int  ui,  ui_res = 0, tmp;
>> +    unsigned long ul,  ul_res = 0;
>> +    uint64_t      ull, ull_res = 0;
>> +
>> +    ui = HIDE(0x80008001U);
>> +    for_each_set_bit_reverse ( i, ui )
>> +        ui_res |= 1U << i;
>> +
>> +    if ( ui != ui_res )
>> +        panic("for_each_set_bit_reverse(uint) expected %#x, got %#x\n", ui, ui_res);
>> +
>> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
>> +    for_each_set_bit_reverse ( i, ul )
>> +        ul_res |= 1UL << i;
>> +
>> +    if ( ul != ul_res )
>> +        panic("for_each_set_bit_reverse(ulong) expected %#lx, got %#lx\n", ul, ul_res);
>> +
>> +    ull = HIDE(0x8000000180000011ULL);
>> +    for_each_set_bit_reverse ( i, ull )
>> +        ull_res |= 1ULL << i;
> ... even here the need for the extra setting of bit 4 remains unclear to
> me: The thing that was missing was the testing of the reverse for-each.
> You mention the need for an asymmetric input in the description, but isn't
> that covered already by the first test using 0x80008001U?

The first test covers {arch_,}f[fl]s() only.  It happens to be safe to
count-from-the-wrong-end bugs, but that was by chance.

The second test covers {arch_,}f[fs]sl() only.  They are unsafe WRT
symmetry, and disjoint (coverage wise) from the first test.

The third test, while the same as the second test on x86, isn't the same
on arm32.


Just because one test happens to be safe (symmetry wise) and passes,
doesn't make the other variants tested.

~Andrew

Re: [PATCH v2] x86/bitops: Optimise arch_ffs{,l}() some more on AMD
Posted by Jan Beulich 1 month, 4 weeks ago
On 01.09.2025 16:21, Andrew Cooper wrote:
> On 27/08/2025 8:52 am, Jan Beulich wrote:
>> On 26.08.2025 19:41, Andrew Cooper wrote:
>>> --- a/xen/common/bitops.c
>>> +++ b/xen/common/bitops.c
>>> @@ -97,14 +97,14 @@ static void __init test_for_each_set_bit(void)
>>>      if ( ui != ui_res )
>>>          panic("for_each_set_bit(uint) expected %#x, got %#x\n", ui, ui_res);
>>>  
>>> -    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 1);
>>> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
>>>      for_each_set_bit ( i, ul )
>>>          ul_res |= 1UL << i;
>>>  
>>>      if ( ul != ul_res )
>>>          panic("for_each_set_bit(ulong) expected %#lx, got %#lx\n", ul, ul_res);
>>>  
>>> -    ull = HIDE(0x8000000180000001ULL);
>>> +    ull = HIDE(0x8000000180000011ULL);
>>>      for_each_set_bit ( i, ull )
>>>          ull_res |= 1ULL << i;
>> How do these changes make a difference? Apart from ffs() using TZCNT, ...
>>
>>> @@ -127,6 +127,79 @@ static void __init test_for_each_set_bit(void)
>>>          panic("for_each_set_bit(break) expected 0x1008, got %#x\n", ui_res);
>>>  }
>>>  
>>> +/*
>>> + * A type-generic fls() which picks the appropriate fls{,l,64}() based on it's
>>> + * argument.
>>> + */
>>> +#define fls_g(x)                                        \
>>> +    (sizeof(x) <= sizeof(int)      ? fls(x) :           \
>>> +     sizeof(x) <= sizeof(long)     ? flsl(x) :          \
>>> +     sizeof(x) <= sizeof(uint64_t) ? fls64(x) :         \
>>> +     ({ BUILD_ERROR("fls_g() Bad input type"); 0; }))
>>> +
>>> +/*
>>> + * for_each_set_bit_reverse() - Iterate over all set bits in a scalar value,
>>> + * from MSB to LSB.
>>> + *
>>> + * @iter An iterator name.  Scoped is within the loop only.
>>> + * @val  A scalar value to iterate over.
>>> + *
>>> + * A copy of @val is taken internally.
>>> + */
>>> +#define for_each_set_bit_reverse(iter, val)             \
>>> +    for ( typeof(val) __v = (val); __v; __v = 0 )       \
>>> +        for ( unsigned int (iter);                      \
>>> +              __v && ((iter) = fls_g(__v) - 1, true);   \
>>> +              __clear_bit(iter, &__v) )
>>> +
>>> +/*
>>> + * Xen doesn't have need of for_each_set_bit_reverse() at present, but the
>>> + * construct does exercise a case of arch_fls*() not covered anywhere else by
>>> + * these tests.
>>> + */
>>> +static void __init test_for_each_set_bit_reverse(void)
>>> +{
>>> +    unsigned int  ui,  ui_res = 0, tmp;
>>> +    unsigned long ul,  ul_res = 0;
>>> +    uint64_t      ull, ull_res = 0;
>>> +
>>> +    ui = HIDE(0x80008001U);
>>> +    for_each_set_bit_reverse ( i, ui )
>>> +        ui_res |= 1U << i;
>>> +
>>> +    if ( ui != ui_res )
>>> +        panic("for_each_set_bit_reverse(uint) expected %#x, got %#x\n", ui, ui_res);
>>> +
>>> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
>>> +    for_each_set_bit_reverse ( i, ul )
>>> +        ul_res |= 1UL << i;
>>> +
>>> +    if ( ul != ul_res )
>>> +        panic("for_each_set_bit_reverse(ulong) expected %#lx, got %#lx\n", ul, ul_res);
>>> +
>>> +    ull = HIDE(0x8000000180000011ULL);
>>> +    for_each_set_bit_reverse ( i, ull )
>>> +        ull_res |= 1ULL << i;
>> ... even here the need for the extra setting of bit 4 remains unclear to
>> me: The thing that was missing was the testing of the reverse for-each.
>> You mention the need for an asymmetric input in the description, but isn't
>> that covered already by the first test using 0x80008001U?
> 
> The first test covers {arch_,}f[fl]s() only.  It happens to be safe to
> count-from-the-wrong-end bugs, but that was by chance.
> 
> The second test covers {arch_,}f[fs]sl() only.  They are unsafe WRT
> symmetry, and disjoint (coverage wise) from the first test.
> 
> The third test, while the same as the second test on x86, isn't the same
> on arm32.
> 
> 
> Just because one test happens to be safe (symmetry wise) and passes,
> doesn't make the other variants tested.

Hmm, okay, it is of course in principle possible that one flavor is screwed
while the other isn't.

Acked-by: Jan Beulich <jbeulich@suse.com>

Jan

Re: [PATCH v2] x86/bitops: Optimise arch_ffs{,l}() some more on AMD
Posted by Andrew Cooper 1 month, 4 weeks ago
On 01/09/2025 3:26 pm, Jan Beulich wrote:
> On 01.09.2025 16:21, Andrew Cooper wrote:
>> On 27/08/2025 8:52 am, Jan Beulich wrote:
>>> On 26.08.2025 19:41, Andrew Cooper wrote:
>>>> --- a/xen/common/bitops.c
>>>> +++ b/xen/common/bitops.c
>>>> @@ -97,14 +97,14 @@ static void __init test_for_each_set_bit(void)
>>>>      if ( ui != ui_res )
>>>>          panic("for_each_set_bit(uint) expected %#x, got %#x\n", ui, ui_res);
>>>>  
>>>> -    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 1);
>>>> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
>>>>      for_each_set_bit ( i, ul )
>>>>          ul_res |= 1UL << i;
>>>>  
>>>>      if ( ul != ul_res )
>>>>          panic("for_each_set_bit(ulong) expected %#lx, got %#lx\n", ul, ul_res);
>>>>  
>>>> -    ull = HIDE(0x8000000180000001ULL);
>>>> +    ull = HIDE(0x8000000180000011ULL);
>>>>      for_each_set_bit ( i, ull )
>>>>          ull_res |= 1ULL << i;
>>> How do these changes make a difference? Apart from ffs() using TZCNT, ...
>>>
>>>> @@ -127,6 +127,79 @@ static void __init test_for_each_set_bit(void)
>>>>          panic("for_each_set_bit(break) expected 0x1008, got %#x\n", ui_res);
>>>>  }
>>>>  
>>>> +/*
>>>> + * A type-generic fls() which picks the appropriate fls{,l,64}() based on it's
>>>> + * argument.
>>>> + */
>>>> +#define fls_g(x)                                        \
>>>> +    (sizeof(x) <= sizeof(int)      ? fls(x) :           \
>>>> +     sizeof(x) <= sizeof(long)     ? flsl(x) :          \
>>>> +     sizeof(x) <= sizeof(uint64_t) ? fls64(x) :         \
>>>> +     ({ BUILD_ERROR("fls_g() Bad input type"); 0; }))
>>>> +
>>>> +/*
>>>> + * for_each_set_bit_reverse() - Iterate over all set bits in a scalar value,
>>>> + * from MSB to LSB.
>>>> + *
>>>> + * @iter An iterator name.  Scoped is within the loop only.
>>>> + * @val  A scalar value to iterate over.
>>>> + *
>>>> + * A copy of @val is taken internally.
>>>> + */
>>>> +#define for_each_set_bit_reverse(iter, val)             \
>>>> +    for ( typeof(val) __v = (val); __v; __v = 0 )       \
>>>> +        for ( unsigned int (iter);                      \
>>>> +              __v && ((iter) = fls_g(__v) - 1, true);   \
>>>> +              __clear_bit(iter, &__v) )
>>>> +
>>>> +/*
>>>> + * Xen doesn't have need of for_each_set_bit_reverse() at present, but the
>>>> + * construct does exercise a case of arch_fls*() not covered anywhere else by
>>>> + * these tests.
>>>> + */
>>>> +static void __init test_for_each_set_bit_reverse(void)
>>>> +{
>>>> +    unsigned int  ui,  ui_res = 0, tmp;
>>>> +    unsigned long ul,  ul_res = 0;
>>>> +    uint64_t      ull, ull_res = 0;
>>>> +
>>>> +    ui = HIDE(0x80008001U);
>>>> +    for_each_set_bit_reverse ( i, ui )
>>>> +        ui_res |= 1U << i;
>>>> +
>>>> +    if ( ui != ui_res )
>>>> +        panic("for_each_set_bit_reverse(uint) expected %#x, got %#x\n", ui, ui_res);
>>>> +
>>>> +    ul = HIDE(1UL << (BITS_PER_LONG - 1) | 0x11);
>>>> +    for_each_set_bit_reverse ( i, ul )
>>>> +        ul_res |= 1UL << i;
>>>> +
>>>> +    if ( ul != ul_res )
>>>> +        panic("for_each_set_bit_reverse(ulong) expected %#lx, got %#lx\n", ul, ul_res);
>>>> +
>>>> +    ull = HIDE(0x8000000180000011ULL);
>>>> +    for_each_set_bit_reverse ( i, ull )
>>>> +        ull_res |= 1ULL << i;
>>> ... even here the need for the extra setting of bit 4 remains unclear to
>>> me: The thing that was missing was the testing of the reverse for-each.
>>> You mention the need for an asymmetric input in the description, but isn't
>>> that covered already by the first test using 0x80008001U?
>> The first test covers {arch_,}f[fl]s() only.  It happens to be safe to
>> count-from-the-wrong-end bugs, but that was by chance.
>>
>> The second test covers {arch_,}f[fs]sl() only.  They are unsafe WRT
>> symmetry, and disjoint (coverage wise) from the first test.
>>
>> The third test, while the same as the second test on x86, isn't the same
>> on arm32.
>>
>>
>> Just because one test happens to be safe (symmetry wise) and passes,
>> doesn't make the other variants tested.
> Hmm, okay, it is of course in principle possible that one flavor is screwed
> while the other isn't.
>
> Acked-by: Jan Beulich <jbeulich@suse.com>

Thanks, but I now have both R-by and A-by you on this patch.  Which
would you prefer?

~Andrew

Re: [PATCH v2] x86/bitops: Optimise arch_ffs{,l}() some more on AMD
Posted by Jan Beulich 1 month, 4 weeks ago
On 01.09.2025 16:31, Andrew Cooper wrote:
> On 01/09/2025 3:26 pm, Jan Beulich wrote:
>> Hmm, okay, it is of course in principle possible that one flavor is screwed
>> while the other isn't.
>>
>> Acked-by: Jan Beulich <jbeulich@suse.com>
> 
> Thanks, but I now have both R-by and A-by you on this patch.  Which
> would you prefer?

Oh, sorry, I checked whether I sent an ack to v2, not paying attention to v2
already having a tag. Please keep the R-b.

Jan