[PATCH v1 0/3] Lockless SMP function call and TLB flushing

Ross Lagerwall posted 3 patches 2 hours ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20260401163521.3603665-1-ross.lagerwall@citrix.com
tools/xentrace/xenalyze.c              |   2 -
xen/arch/x86/include/asm/irq-vectors.h |   1 -
xen/arch/x86/include/asm/irq.h         |   1 -
xen/arch/x86/mm/hap/hap.c              |   2 +-
xen/arch/x86/smp.c                     |  30 ++++----
xen/arch/x86/smpboot.c                 |   1 -
xen/common/smp.c                       | 101 ++++++++++++++++---------
7 files changed, 80 insertions(+), 58 deletions(-)
[PATCH v1 0/3] Lockless SMP function call and TLB flushing
Posted by Ross Lagerwall 2 hours ago
Hi,

This series implements lockless SMP function call and then rewrites x86 TLB
flushing to use SMP function calls.

We have observed that the TLB flush lock can be a point of contention for
certain workloads, e.g. migrating 10 VMs off a host during a host evacuation.

Performance numbers:

I wrote a synthetic benchmark to measure the performance. The benchmark has one
or more CPUs in Xen calling on_selected_cpus() with between 1 and 64 CPUs in
the selected mask. The executed function simply delays for 500 microseconds.

The table below shows the % change in execution time of on_selected_cpus():

                  1 thread   2 threads    4 threads
1 CPU in mask     0.02       -35.23       -51.18
2 CPUs in mask    0.01       -47.20       -69.27
4 CPUs in mask    -0.02      -42.40       -66.55
8 CPUs in mask    -0.03      -47.82       -68.39
16 CPUs in mask   0.12       -41.95       -58.26
32 CPUs in mask   0.02       -25.43       -39.35
64 CPUs in mask   0.00       -24.70       -37.83

With 1 thread (i.e. no contention), there is no regression in execution time.
With multiple threads, as expected there is a significant improvement in
execution time.

As a more practical benchmark to simulate host evacuation, I measured the
memory dirtying rate across 10 VMs after enabling log dirty (on an AMD system,
so without PML). The rate increased by 16% with this patch series, even
after the recent deferred TLB flush changes.

FWIW, my first attempt at this was to port the SMP call functionality from
Linux. I found it didn't scale well as the number of CPUs in the mask
increases so I've taken a different approach here.

Thanks,
Ross

Ross Lagerwall (3):
  x86/hap: Wait for remote CPUs during TLB flush
  xen/smp: Rewrite on_selected_cpus() to be lockless
  x86/smp: Rewrite TLB flush using on_selected_cpus()

 tools/xentrace/xenalyze.c              |   2 -
 xen/arch/x86/include/asm/irq-vectors.h |   1 -
 xen/arch/x86/include/asm/irq.h         |   1 -
 xen/arch/x86/mm/hap/hap.c              |   2 +-
 xen/arch/x86/smp.c                     |  30 ++++----
 xen/arch/x86/smpboot.c                 |   1 -
 xen/common/smp.c                       | 101 ++++++++++++++++---------
 7 files changed, 80 insertions(+), 58 deletions(-)

-- 
2.53.0