[v1] xen/bitops: hweight() cleanup/improvements

[PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()

Posted by Andrew Cooper 1 year, 5 months ago

Using hweight() is an especially expensive way of determining simply if
multiple bits are set in a value.  Worse, 4 of the 10 hweight() calls in Xen
are of this form.

Switch to the new multiple_bits_set() helper.  This is far more efficient than
the longhand hweight() algorithm and, owing to its simplicity, likely more
efficient than even a dedicated instruction on a superscalar processor.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>

On x86, the code reduction speaks for itself:

  add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-240 (-240)
  Function                                     old     new   delta
  vlapic_ipi                                   722     650     -72
  numa_emulation                               577     497     -80
  do_xenpmu_op                                1665    1577     -88

That's an aweful lot of wasted calculation for the same answer.

I can't find a way of enabling CONFIG_NUMA on ARM (yet?) so right now there's
no change in any other architecture.  However, a synthetic helper shows the
following on ARM32:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-128 (-128)
  Function                                     old     new   delta
  test_mbs                                     176      48    -128

and on ARM64:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44)
  Function                                     old     new   delta
  test_mbs                                      60      16     -44

PPC64 has POPCNTD in a default build of Xen and by chance both algorithms
compile to the same number of instructions.

RISC-V doesn't have hweight() wired up yet at all.
---
 xen/arch/x86/cpu/vpmu.c   |  2 +-
 xen/arch/x86/hvm/vlapic.c | 10 ++++++----
 xen/common/numa.c         |  2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/cpu/vpmu.c b/xen/arch/x86/cpu/vpmu.c
index b2ba9994129b..a5bb1689c7d5 100644
--- a/xen/arch/x86/cpu/vpmu.c
+++ b/xen/arch/x86/cpu/vpmu.c
@@ -673,7 +673,7 @@ long do_xenpmu_op(
     {
         if ( (pmu_params.val &
               ~(XENPMU_MODE_SELF | XENPMU_MODE_HV | XENPMU_MODE_ALL)) ||
-             (hweight64(pmu_params.val) > 1) )
+             multiple_bits_set(pmu_params.val) )
             return -EINVAL;
 
         /* 32-bit dom0 can only sample itself. */
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 2ec95942713e..4a3e21a65f9d 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -467,12 +467,14 @@ static bool is_multicast_dest(struct vlapic *vlapic, unsigned int short_hand,
         return short_hand != APIC_DEST_SELF;
 
     if ( vlapic_x2apic_mode(vlapic) )
-        return dest_mode ? hweight16(dest) > 1 : dest == 0xffffffffU;
+        return dest_mode ? multiple_bits_set((uint16_t)dest)
+                         : dest == 0xffffffffU;
 
     if ( dest_mode )
-        return hweight8(dest &
-                        GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic,
-                                                            APIC_DFR))) > 1;
+    {
+        dest &= GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic, APIC_DFR));
+        return multiple_bits_set((uint8_t)dest);
+    }
 
     return dest == 0xff;
 }
diff --git a/xen/common/numa.c b/xen/common/numa.c
index 28a09766fabc..ce3991929ce5 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -546,7 +546,7 @@ static int __init numa_emulation(unsigned long start_pfn,
     uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
 
     /* Kludge needed for the hash function */
-    if ( hweight64(sz) > 1 )
+    if ( multiple_bits_set(sz) )
     {
         uint64_t x = 1;
 
-- 
2.39.2

Re: [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()

Posted by Jan Beulich 1 year, 5 months ago

On 23.08.2024 01:06, Andrew Cooper wrote:
> Using hweight() is an especially expensive way of determining simply if
> multiple bits are set in a value.  Worse, 4 of the 10 hweight() calls in Xen
> are of this form.
> 
> Switch to the new multiple_bits_set() helper.  This is far more efficient than
> the longhand hweight() algorithm and, owing to its simplicity, likely more
> efficient than even a dedicated instruction on a superscalar processor.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Just to mention it: With the title being sufficiently generic, I half
expected the similar {bitmap,cpumask}_weight() to also be taken care of.
Yet certainly those (and maybe also their ... == 1 forms) can be covered
later.

Jan

Re: [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()

Posted by Andrew Cooper 1 year, 5 months ago

On 26/08/2024 11:37 am, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> Using hweight() is an especially expensive way of determining simply if
>> multiple bits are set in a value.  Worse, 4 of the 10 hweight() calls in Xen
>> are of this form.
>>
>> Switch to the new multiple_bits_set() helper.  This is far more efficient than
>> the longhand hweight() algorithm and, owing to its simplicity, likely more
>> efficient than even a dedicated instruction on a superscalar processor.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
> Just to mention it: With the title being sufficiently generic, I half
> expected the similar {bitmap,cpumask}_weight() to also be taken care of.
> Yet certainly those (and maybe also their ... == 1 forms) can be covered
> later.

They are on my radar, but are not as easy to transform.

We'd need a new bitmap API to optimise this case; looking at the
callers, it is definitely worth optimising.  There are also an awful lot
of cpumask_weight() == 1 which want converting, albeit slightly differently.

That's a full series of work in and of itself.

~Andrew

[PATCH 1/9] xen/bitops: Reinstate the please tidy message
[PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper
[PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()
[PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}()
[PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl()
[PATCH 6/9] xen/bitops: Drop hweight_long() and use hweightl()
[PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
[PATCH 8/9] xen/bitops: Implement hweight32() in terms of hweightl()
[PATCH 9/9] x86/bitops: Use the POPCNT instruction when available