[PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case

Andrew Cooper posted 1 patch 4 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20231222220045.2840714-1-andrew.cooper3@citrix.com
There is a newer version of this series
xen/common/livepatch.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
[PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
Posted by Andrew Cooper 4 months, 2 weeks ago
When livepatching is enabled, this function is used all the time.  Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures).

This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
the optimiser has an easier time too.  Bloat-o-meter reports:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
  Function                                     old     new   delta
  check_for_livepatch_work.cold               1201    1183     -18
  check_for_livepatch_work                    1021     982     -39

which isn't too shabby for no logical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

I'm still a little disappointed with the code generation.  GCC still chooses
to set up the full stack frame (6 regs, +3 more slots) intermixed with the
per-cpu calculations.

In isolation, GCC can check the boolean without creating a stack frame:

  <work_to_to>:
    48 89 e2                mov    %rsp,%rdx
    48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
    48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
    8b 4a c1                mov    -0x3f(%rdx),%ecx
    48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
    48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
    0f b6 04 02             movzbl (%rdx,%rax,1),%eax
    c3                      retq

but I can't find a way to convince GCC that it would be worth not setting up a
stack frame in in the common case, and having a few extra mov reg/reg's later
in the uncommon case.

I haven't tried manually splitting the function into a check() and a do()
function.  Views on whether that might be acceptable?  At a guess, do() would
need to be a static noinline to avoid it turning back into what it currently
is.
---
 xen/common/livepatch.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 1209fea2566c..b6275339f663 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -1706,15 +1706,15 @@ void check_for_livepatch_work(void)
     s_time_t timeout;
     unsigned long flags;
 
+    /* Fast path: no work to do. */
+    if ( likely(!per_cpu(work_to_do, cpu)) )
+        return;
+
     /* Only do any work when invoked in truly idle state. */
     if ( system_state != SYS_STATE_active ||
          !is_idle_domain(current->sched_unit->domain) )
         return;
 
-    /* Fast path: no work to do. */
-    if ( !per_cpu(work_to_do, cpu ) )
-        return;
-
     smp_rmb();
     /* In case we aborted, other CPUs can skip right away. */
     if ( !livepatch_work.do_work )

base-commit: 49818cde637b5ec20383e46b71f93b2e7d867686
-- 
2.30.2


Re: [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
Posted by Ross Lagerwall 4 months ago
On Fri, Dec 22, 2023 at 10:01 PM Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).
>
> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
> the optimiser has an easier time too.  Bloat-o-meter reports:
>
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>   Function                                     old     new   delta
>   check_for_livepatch_work.cold               1201    1183     -18
>   check_for_livepatch_work                    1021     982     -39
>
> which isn't too shabby for no logical change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Re: [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
Posted by Jan Beulich 4 months ago
On 22.12.2023 23:00, Andrew Cooper wrote:
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).
> 
> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
> the optimiser has an easier time too.  Bloat-o-meter reports:
> 
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>   Function                                     old     new   delta
>   check_for_livepatch_work.cold               1201    1183     -18
>   check_for_livepatch_work                    1021     982     -39
> 
> which isn't too shabby for no logical change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> I'm still a little disappointed with the code generation.  GCC still chooses
> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
> per-cpu calculations.
> 
> In isolation, GCC can check the boolean without creating a stack frame:
> 
>   <work_to_to>:
>     48 89 e2                mov    %rsp,%rdx
>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>     c3                      retq
> 
> but I can't find a way to convince GCC that it would be worth not setting up a
> stack frame in in the common case, and having a few extra mov reg/reg's later
> in the uncommon case.
> 
> I haven't tried manually splitting the function into a check() and a do()
> function.  Views on whether that might be acceptable?  At a guess, do() would
> need to be a static noinline to avoid it turning back into what it currently
> is.

Or maybe move the fast-path check into an inline function, which calls the
(renamed) out-of-line one only when the fast-path check passes? Downside
would be that the per-CPU work_to_do variable then couldn't be static anymore.

Jan
Re: [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
Posted by Andrew Cooper 4 months ago
On 04/01/2024 1:24 pm, Jan Beulich wrote:
> On 22.12.2023 23:00, Andrew Cooper wrote:
>> When livepatching is enabled, this function is used all the time.  Really do
>> check the fastpath first, and annotate it likely() as this is the right answer
>> 100% of the time (to many significant figures).
>>
>> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
>> the optimiser has an easier time too.  Bloat-o-meter reports:
>>
>>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>>   Function                                     old     new   delta
>>   check_for_livepatch_work.cold               1201    1183     -18
>>   check_for_livepatch_work                    1021     982     -39
>>
>> which isn't too shabby for no logical change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>> I'm still a little disappointed with the code generation.  GCC still chooses
>> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
>> per-cpu calculations.
>>
>> In isolation, GCC can check the boolean without creating a stack frame:
>>
>>   <work_to_to>:
>>     48 89 e2                mov    %rsp,%rdx
>>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
>>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
>>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>>     c3                      retq
>>
>> but I can't find a way to convince GCC that it would be worth not setting up a
>> stack frame in in the common case, and having a few extra mov reg/reg's later
>> in the uncommon case.
>>
>> I haven't tried manually splitting the function into a check() and a do()
>> function.  Views on whether that might be acceptable?  At a guess, do() would
>> need to be a static noinline to avoid it turning back into what it currently
>> is.
> Or maybe move the fast-path check into an inline function, which calls the
> (renamed) out-of-line one only when the fast-path check passes? Downside
> would be that the per-CPU work_to_do variable then couldn't be static anymore.

We can't do that unfortunately.  It's called from assembly in
reset_stack_and_*()

But, I've had another idea.  I think attribute cold can help here, as it
will move the majority of the implementation into a separate section.

~Andrew

[PATCH v1-alt] xen/livepatch: Make check_for_livepatch_work() faster in the common case
Posted by Andrew Cooper 3 months, 3 weeks ago
When livepatching is enabled, this function is used all the time.  Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures).  This cuts out 3 pointer
dereferences in the "nothing to do path".

However, GCC still needs some help to persuade it not to set the full stack
frame (6 spilled registers, 3 slots of locals) even on the fastpath.

Create a new check_for_livepatch_work() with the fastpath only, and make the
"new" do_livepatch_work() noinline.  This causes the fastpath to need no stack
frame, making it faster still.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

v1-alt:
 * Manually split the functions.

Experimenting with __attribute__((cold)) was disappointing.  Vs this patch, it
creates an extra check_for_livepatch_work.cold function(and section) which is
just `jmp do_livepatch_work`.
---
 xen/common/livepatch.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 1209fea2566c..2c4b84382798 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -1693,7 +1693,7 @@ static int livepatch_spin(atomic_t *counter, s_time_t timeout,
  * The main function which manages the work of quiescing the system and
  * patching code.
  */
-void check_for_livepatch_work(void)
+static void noinline do_livepatch_work(void)
 {
 #define ACTION(x) [LIVEPATCH_ACTION_##x] = #x
     static const char *const names[] = {
@@ -1711,10 +1711,6 @@ void check_for_livepatch_work(void)
          !is_idle_domain(current->sched_unit->domain) )
         return;
 
-    /* Fast path: no work to do. */
-    if ( !per_cpu(work_to_do, cpu ) )
-        return;
-
     smp_rmb();
     /* In case we aborted, other CPUs can skip right away. */
     if ( !livepatch_work.do_work )
@@ -1864,6 +1860,17 @@ void check_for_livepatch_work(void)
     }
 }
 
+void check_for_livepatch_work(void)
+{
+    unsigned int cpu = smp_processor_id();
+
+    /* Fast path: no work to do. */
+    if ( likely(!per_cpu(work_to_do, cpu)) )
+        return;
+
+    do_livepatch_work();
+}
+
 /*
  * Only allow dependent payload is applied on top of the correct
  * build-id.
-- 
2.30.2


Re: [PATCH v1-alt] xen/livepatch: Make check_for_livepatch_work() faster in the common case
Posted by Jan Beulich 3 months, 3 weeks ago
On 11.01.2024 21:11, Andrew Cooper wrote:
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).  This cuts out 3 pointer
> dereferences in the "nothing to do path".
> 
> However, GCC still needs some help to persuade it not to set the full stack
> frame (6 spilled registers, 3 slots of locals) even on the fastpath.
> 
> Create a new check_for_livepatch_work() with the fastpath only, and make the
> "new" do_livepatch_work() noinline.  This causes the fastpath to need no stack
> frame, making it faster still.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>