x86/vpic: Improve bitops usage

[PATCH] x86/vpic: Improve bitops usage

Posted by Andrew Cooper 9 months, 1 week ago

 * For vpic_get_priority(), introduce a common ror8() helper in plain C.  One
   thing that I can't persuade the compiler to realise is that a non-zero
   value rotated is still non-zero, so use __builtin_clz() to help the
   optimiser out.

 * vpic_ioport_write() can be simplified to just for_each_set_bit(), which
   avoids spilling pending to the stack each loop iteration.  Changing pending
   from unsigned int to uint8_t isn't even strictly necessary given the
   underlying types of vpic->isr and vpic->irr, but done so clarity.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

More detailed analysis:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-53 (-53)
  Function                                     old     new   delta
  vpic_get_highest_priority_irq                134     124     -10
  vpic_intercept_pic_io                       1525    1482     -43

One thing I can't seem to avoid is GCC zero extending the result of ror8()
before passing it to BSF/TZCNT.  Then again, that's very specific to uint8_t,
and x86's preserving behaviour on byte writes.

The space saving in vpic_get_highest_priority_irq(), which has
vpic_get_priority() inlined twice, seems to come from better return-value
handling.  GCC is happy instruction scheduling over __builtin_ctz(), and I
presume it also benefits from range analysis of the result.
---
 xen/arch/x86/hvm/vpic.c  | 21 +++++++--------------
 xen/include/xen/bitops.h |  6 ++++++
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/hvm/vpic.c b/xen/arch/x86/hvm/vpic.c
index 22020322fbba..30ce367c082d 100644
--- a/xen/arch/x86/hvm/vpic.c
+++ b/xen/arch/x86/hvm/vpic.c
@@ -47,17 +47,16 @@
 #define VPIC_PRIO_NONE 8
 static int vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask)
 {
-    int prio;
-
     ASSERT(vpic_is_locked(vpic));
 
     if ( mask == 0 )
         return VPIC_PRIO_NONE;
 
-    /* prio = ffs(mask ROR vpic->priority_add); */
-    asm ( "ror %%cl,%b1 ; rep; bsf %1,%0"
-          : "=r" (prio) : "q" ((uint32_t)mask), "c" (vpic->priority_add) );
-    return prio;
+    /*
+     * We use __builtin_ctz() rather than ffs() because the compiler can't
+     * reason that a nonzero mask rotated is still nonzero.
+     */
+    return __builtin_ctz(ror8(mask, vpic->priority_add));
 }
 
 /* Return the PIC's highest priority pending interrupt. Return -1 if none. */
@@ -196,7 +195,7 @@ static void vpic_ioport_write(
     {
         if ( val & 0x10 )
         {
-            unsigned int pending = vpic->isr | (vpic->irr & ~vpic->elcr);
+            uint8_t pending = vpic->isr | (vpic->irr & ~vpic->elcr);
 
             /* ICW1 */
             /* Clear edge-sensing logic. */
@@ -229,15 +228,9 @@ static void vpic_ioport_write(
              * been cleared from IRR or ISR, or else the dpci logic will get
              * out of sync with the state of the interrupt controller.
              */
-            while ( pending )
-            {
-                unsigned int pin = __scanbit(pending, 8);
-
-                ASSERT(pin < 8);
+            for_each_set_bit ( pin, pending )
                 hvm_dpci_eoi(current->domain,
                              hvm_isa_irq_to_gsi((addr >> 7) ? (pin | 8) : pin));
-                __clear_bit(pin, &pending);
-            }
             return;
         }
         else if ( val & 0x08 )
diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index 448b2d3e0937..a4d31ec02a7c 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -418,6 +418,12 @@ static inline uint32_t rol32(uint32_t word, unsigned int shift)
     return (word << shift) | (word >> (32 - shift));
 }
 
+/* ror8 - rotate an 8-bit value right */
+static inline uint8_t ror8(uint8_t value, unsigned int shift)
+{
+    return (value >> shift) | (value << (8 - shift));
+}
+
 /*
  * ror32 - rotate a 32-bit value right
  *
-- 
2.39.5

Re: [PATCH] x86/vpic: Improve bitops usage

Posted by Jan Beulich 9 months, 1 week ago

On 02.05.2025 01:30, Andrew Cooper wrote:
>  * For vpic_get_priority(), introduce a common ror8() helper in plain C.  One
>    thing that I can't persuade the compiler to realise is that a non-zero
>    value rotated is still non-zero, so use __builtin_clz() to help the
>    optimiser out.
> 
>  * vpic_ioport_write() can be simplified to just for_each_set_bit(), which
>    avoids spilling pending to the stack each loop iteration.  Changing pending
>    from unsigned int to uint8_t isn't even strictly necessary given the
>    underlying types of vpic->isr and vpic->irr, but done so clarity.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> One thing I can't seem to avoid is GCC zero extending the result of ror8()
> before passing it to BSF/TZCNT.  Then again, that's very specific to uint8_t,
> and x86's preserving behaviour on byte writes.

It can't know that the upper bits are "don't care" here, related to the aspect
above (it not inferring non-zero of the result of the rotate when the input is
non-zero). And I expect it would be too much of a special case to warrant
getting all of this accounted for, just to remove the zero-ext.

For this specific case we might be able to cheat, but I'm unsure that's worth
it either (and I also haven't tried whether it actually has the intended
effect):

    val8 = ror8(mask, vpic->priority_add);
    asm ( "" : "=r" (val32) : "0" (val8) );
    return __builtin_ctz(val32);

Jan

Re: [PATCH] x86/vpic: Improve bitops usage

Posted by Andrew Cooper 9 months, 1 week ago

On 02/05/2025 9:04 am, Jan Beulich wrote:
> On 02.05.2025 01:30, Andrew Cooper wrote:
>>  * For vpic_get_priority(), introduce a common ror8() helper in plain C.  One
>>    thing that I can't persuade the compiler to realise is that a non-zero
>>    value rotated is still non-zero, so use __builtin_clz() to help the
>>    optimiser out.
>>
>>  * vpic_ioport_write() can be simplified to just for_each_set_bit(), which
>>    avoids spilling pending to the stack each loop iteration.  Changing pending
>>    from unsigned int to uint8_t isn't even strictly necessary given the
>>    underlying types of vpic->isr and vpic->irr, but done so clarity.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>> One thing I can't seem to avoid is GCC zero extending the result of ror8()
>> before passing it to BSF/TZCNT.  Then again, that's very specific to uint8_t,
>> and x86's preserving behaviour on byte writes.
> It can't know that the upper bits are "don't care" here, related to the aspect
> above (it not inferring non-zero of the result of the rotate when the input is
> non-zero).

Hmm, I suppose so.

> And I expect it would be too much of a special case to warrant
> getting all of this accounted for, just to remove the zero-ext.
>
> For this specific case we might be able to cheat, but I'm unsure that's worth
> it either (and I also haven't tried whether it actually has the intended
> effect):
>
>     val8 = ror8(mask, vpic->priority_add);
>     asm ( "" : "=r" (val32) : "0" (val8) );
>     return __builtin_ctz(val32);

It's one zero-extend.  I was only curious because I was looking in
detail at the code generation, but it's not something I'm losing sleep over.

~Andrew