Hi,
This series implements lockless SMP function call and then rewrites x86 TLB
flushing to use SMP function calls.
We have observed that the TLB flush lock can be a point of contention for
certain workloads, e.g. migrating 10 VMs off a host during a host evacuation.
Performance numbers:
I wrote a synthetic benchmark to measure the performance. The benchmark has one
or more CPUs in Xen calling on_selected_cpus() with between 1 and 64 CPUs in
the selected mask. The executed function simply delays for 500 microseconds.
The table below shows the % change in execution time of on_selected_cpus():
1 thread 2 threads 4 threads
1 CPU in mask 0.02 -35.23 -51.18
2 CPUs in mask 0.01 -47.20 -69.27
4 CPUs in mask -0.02 -42.40 -66.55
8 CPUs in mask -0.03 -47.82 -68.39
16 CPUs in mask 0.12 -41.95 -58.26
32 CPUs in mask 0.02 -25.43 -39.35
64 CPUs in mask 0.00 -24.70 -37.83
With 1 thread (i.e. no contention), there is no regression in execution time.
With multiple threads, as expected there is a significant improvement in
execution time.
As a more practical benchmark to simulate host evacuation, I measured the
memory dirtying rate across 10 VMs after enabling log dirty (on an AMD system,
so without PML). The rate increased by 16% with this patch series, even
after the recent deferred TLB flush changes.
FWIW, my first attempt at this was to port the SMP call functionality from
Linux. I found it didn't scale well as the number of CPUs in the mask
increases so I've taken a different approach here.
Thanks,
Ross
Ross Lagerwall (3):
x86/hap: Wait for remote CPUs during TLB flush
xen/smp: Rewrite on_selected_cpus() to be lockless
x86/smp: Rewrite TLB flush using on_selected_cpus()
tools/xentrace/xenalyze.c | 2 -
xen/arch/x86/include/asm/irq-vectors.h | 1 -
xen/arch/x86/include/asm/irq.h | 1 -
xen/arch/x86/mm/hap/hap.c | 2 +-
xen/arch/x86/smp.c | 30 ++++----
xen/arch/x86/smpboot.c | 1 -
xen/common/smp.c | 101 ++++++++++++++++---------
7 files changed, 80 insertions(+), 58 deletions(-)
--
2.53.0