xen/common/livepatch.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
When livepatching is enabled, this function is used all the time. Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures).
This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
the optimiser has an easier time too. Bloat-o-meter reports:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
Function old new delta
check_for_livepatch_work.cold 1201 1183 -18
check_for_livepatch_work 1021 982 -39
which isn't too shabby for no logical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
I'm still a little disappointed with the code generation. GCC still chooses
to set up the full stack frame (6 regs, +3 more slots) intermixed with the
per-cpu calculations.
In isolation, GCC can check the boolean without creating a stack frame:
<work_to_to>:
48 89 e2 mov %rsp,%rdx
48 8d 05 de e1 37 00 lea 0x37e1de(%rip),%rax # ffff82d0405b6068 <per_cpu__work_to_do>
48 81 ca ff 7f 00 00 or $0x7fff,%rdx
8b 4a c1 mov -0x3f(%rdx),%ecx
48 8d 15 45 aa 39 00 lea 0x39aa45(%rip),%rdx # ffff82d0405d28e0 <__per_cpu_offset>
48 8b 14 ca mov (%rdx,%rcx,8),%rdx
0f b6 04 02 movzbl (%rdx,%rax,1),%eax
c3 retq
but I can't find a way to convince GCC that it would be worth not setting up a
stack frame in in the common case, and having a few extra mov reg/reg's later
in the uncommon case.
I haven't tried manually splitting the function into a check() and a do()
function. Views on whether that might be acceptable? At a guess, do() would
need to be a static noinline to avoid it turning back into what it currently
is.
---
xen/common/livepatch.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 1209fea2566c..b6275339f663 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -1706,15 +1706,15 @@ void check_for_livepatch_work(void)
s_time_t timeout;
unsigned long flags;
+ /* Fast path: no work to do. */
+ if ( likely(!per_cpu(work_to_do, cpu)) )
+ return;
+
/* Only do any work when invoked in truly idle state. */
if ( system_state != SYS_STATE_active ||
!is_idle_domain(current->sched_unit->domain) )
return;
- /* Fast path: no work to do. */
- if ( !per_cpu(work_to_do, cpu ) )
- return;
-
smp_rmb();
/* In case we aborted, other CPUs can skip right away. */
if ( !livepatch_work.do_work )
base-commit: 49818cde637b5ec20383e46b71f93b2e7d867686
--
2.30.2
On Fri, Dec 22, 2023 at 10:01 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > When livepatching is enabled, this function is used all the time. Really do > check the fastpath first, and annotate it likely() as this is the right answer > 100% of the time (to many significant figures). > > This cuts out 3 pointer dereferences in the "nothing to do path", and it seems > the optimiser has an easier time too. Bloat-o-meter reports: > > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57) > Function old new delta > check_for_livepatch_work.cold 1201 1183 -18 > check_for_livepatch_work 1021 982 -39 > > which isn't too shabby for no logical change. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
On 22.12.2023 23:00, Andrew Cooper wrote: > When livepatching is enabled, this function is used all the time. Really do > check the fastpath first, and annotate it likely() as this is the right answer > 100% of the time (to many significant figures). > > This cuts out 3 pointer dereferences in the "nothing to do path", and it seems > the optimiser has an easier time too. Bloat-o-meter reports: > > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57) > Function old new delta > check_for_livepatch_work.cold 1201 1183 -18 > check_for_livepatch_work 1021 982 -39 > > which isn't too shabby for no logical change. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> > I'm still a little disappointed with the code generation. GCC still chooses > to set up the full stack frame (6 regs, +3 more slots) intermixed with the > per-cpu calculations. > > In isolation, GCC can check the boolean without creating a stack frame: > > <work_to_to>: > 48 89 e2 mov %rsp,%rdx > 48 8d 05 de e1 37 00 lea 0x37e1de(%rip),%rax # ffff82d0405b6068 <per_cpu__work_to_do> > 48 81 ca ff 7f 00 00 or $0x7fff,%rdx > 8b 4a c1 mov -0x3f(%rdx),%ecx > 48 8d 15 45 aa 39 00 lea 0x39aa45(%rip),%rdx # ffff82d0405d28e0 <__per_cpu_offset> > 48 8b 14 ca mov (%rdx,%rcx,8),%rdx > 0f b6 04 02 movzbl (%rdx,%rax,1),%eax > c3 retq > > but I can't find a way to convince GCC that it would be worth not setting up a > stack frame in in the common case, and having a few extra mov reg/reg's later > in the uncommon case. > > I haven't tried manually splitting the function into a check() and a do() > function. Views on whether that might be acceptable? At a guess, do() would > need to be a static noinline to avoid it turning back into what it currently > is. Or maybe move the fast-path check into an inline function, which calls the (renamed) out-of-line one only when the fast-path check passes? Downside would be that the per-CPU work_to_do variable then couldn't be static anymore. Jan
On 04/01/2024 1:24 pm, Jan Beulich wrote: > On 22.12.2023 23:00, Andrew Cooper wrote: >> When livepatching is enabled, this function is used all the time. Really do >> check the fastpath first, and annotate it likely() as this is the right answer >> 100% of the time (to many significant figures). >> >> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems >> the optimiser has an easier time too. Bloat-o-meter reports: >> >> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57) >> Function old new delta >> check_for_livepatch_work.cold 1201 1183 -18 >> check_for_livepatch_work 1021 982 -39 >> >> which isn't too shabby for no logical change. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> Thanks. > >> I'm still a little disappointed with the code generation. GCC still chooses >> to set up the full stack frame (6 regs, +3 more slots) intermixed with the >> per-cpu calculations. >> >> In isolation, GCC can check the boolean without creating a stack frame: >> >> <work_to_to>: >> 48 89 e2 mov %rsp,%rdx >> 48 8d 05 de e1 37 00 lea 0x37e1de(%rip),%rax # ffff82d0405b6068 <per_cpu__work_to_do> >> 48 81 ca ff 7f 00 00 or $0x7fff,%rdx >> 8b 4a c1 mov -0x3f(%rdx),%ecx >> 48 8d 15 45 aa 39 00 lea 0x39aa45(%rip),%rdx # ffff82d0405d28e0 <__per_cpu_offset> >> 48 8b 14 ca mov (%rdx,%rcx,8),%rdx >> 0f b6 04 02 movzbl (%rdx,%rax,1),%eax >> c3 retq >> >> but I can't find a way to convince GCC that it would be worth not setting up a >> stack frame in in the common case, and having a few extra mov reg/reg's later >> in the uncommon case. >> >> I haven't tried manually splitting the function into a check() and a do() >> function. Views on whether that might be acceptable? At a guess, do() would >> need to be a static noinline to avoid it turning back into what it currently >> is. > Or maybe move the fast-path check into an inline function, which calls the > (renamed) out-of-line one only when the fast-path check passes? Downside > would be that the per-CPU work_to_do variable then couldn't be static anymore. We can't do that unfortunately. It's called from assembly in reset_stack_and_*() But, I've had another idea. I think attribute cold can help here, as it will move the majority of the implementation into a separate section. ~Andrew
When livepatching is enabled, this function is used all the time. Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures). This cuts out 3 pointer
dereferences in the "nothing to do path".
However, GCC still needs some help to persuade it not to set the full stack
frame (6 spilled registers, 3 slots of locals) even on the fastpath.
Create a new check_for_livepatch_work() with the fastpath only, and make the
"new" do_livepatch_work() noinline. This causes the fastpath to need no stack
frame, making it faster still.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
v1-alt:
* Manually split the functions.
Experimenting with __attribute__((cold)) was disappointing. Vs this patch, it
creates an extra check_for_livepatch_work.cold function(and section) which is
just `jmp do_livepatch_work`.
---
xen/common/livepatch.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 1209fea2566c..2c4b84382798 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -1693,7 +1693,7 @@ static int livepatch_spin(atomic_t *counter, s_time_t timeout,
* The main function which manages the work of quiescing the system and
* patching code.
*/
-void check_for_livepatch_work(void)
+static void noinline do_livepatch_work(void)
{
#define ACTION(x) [LIVEPATCH_ACTION_##x] = #x
static const char *const names[] = {
@@ -1711,10 +1711,6 @@ void check_for_livepatch_work(void)
!is_idle_domain(current->sched_unit->domain) )
return;
- /* Fast path: no work to do. */
- if ( !per_cpu(work_to_do, cpu ) )
- return;
-
smp_rmb();
/* In case we aborted, other CPUs can skip right away. */
if ( !livepatch_work.do_work )
@@ -1864,6 +1860,17 @@ void check_for_livepatch_work(void)
}
}
+void check_for_livepatch_work(void)
+{
+ unsigned int cpu = smp_processor_id();
+
+ /* Fast path: no work to do. */
+ if ( likely(!per_cpu(work_to_do, cpu)) )
+ return;
+
+ do_livepatch_work();
+}
+
/*
* Only allow dependent payload is applied on top of the correct
* build-id.
--
2.30.2
On 11.01.2024 21:11, Andrew Cooper wrote: > When livepatching is enabled, this function is used all the time. Really do > check the fastpath first, and annotate it likely() as this is the right answer > 100% of the time (to many significant figures). This cuts out 3 pointer > dereferences in the "nothing to do path". > > However, GCC still needs some help to persuade it not to set the full stack > frame (6 spilled registers, 3 slots of locals) even on the fastpath. > > Create a new check_for_livepatch_work() with the fastpath only, and make the > "new" do_livepatch_work() noinline. This causes the fastpath to need no stack > frame, making it faster still. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
© 2016 - 2024 Red Hat, Inc.