[XEN PATCH 0/9] x86: parallelize AP bring-up during boot

Krystian Hebel posted 9 patches 5 months, 2 weeks ago
Only 8 patches received!
xen/arch/x86/acpi/cpu_idle.c          |   4 +-
xen/arch/x86/acpi/lib.c               |   2 +-
xen/arch/x86/apic.c                   |   2 +-
xen/arch/x86/boot/trampoline.S        |  20 +++
xen/arch/x86/boot/x86_64.S            |  34 ++++-
xen/arch/x86/cpu/mwait-idle.c         |   4 +-
xen/arch/x86/domain.c                 |   2 +-
xen/arch/x86/include/asm/asm_defns.h  |   2 +-
xen/arch/x86/include/asm/cpufeature.h |   2 +
xen/arch/x86/include/asm/processor.h  |   2 +
xen/arch/x86/include/asm/smp.h        |   7 +-
xen/arch/x86/mpparse.c                |   6 +-
xen/arch/x86/numa.c                   |  17 +--
xen/arch/x86/platform_hypercall.c     |   2 +-
xen/arch/x86/setup.c                  |  25 +++-
xen/arch/x86/shutdown.c               |  20 ++-
xen/arch/x86/smpboot.c                | 207 ++++++++++++++++----------
xen/arch/x86/spec_ctrl.c              |   2 +-
xen/arch/x86/sysctl.c                 |   2 +-
xen/arch/x86/traps.c                  |   4 +-
xen/arch/x86/x86_64/asm-offsets.c     |   5 +-
xen/include/xen/smp.h                 |   2 -
22 files changed, 248 insertions(+), 125 deletions(-)
[XEN PATCH 0/9] x86: parallelize AP bring-up during boot
Posted by Krystian Hebel 5 months, 2 weeks ago
Patch series available on this branch:
https://github.com/TrenchBoot/xen/tree/smp_cleanup_upstreaming

This series makes AP bring-up parallel on x86. This reduces number of
IPIs (and more importantly, delays between them) required to start all
logical processors significantly.

In order to make it possible, some state variables that were global
had to be made per-CPU. In most cases, accesses to those variables can
be performed through helper macros, some of them existed before in a
different form.

In addition to work required for parallel initialization, I've fixed
issues in error path around `machine_restart()` that were discovered
during implementation and testing.

CPU hotplug should not be affected, but I had no way of testing it.
During wakeup from S3 APs are started one by one. It should be possible
to enable parallel execution there as well, but I don't have a way of
testing it as of now.

To measure the improvement, I added output lines (identical for before
and after changes so there is no impact of printing over serial) just
above and below `if ( !pv_shim )` block. `console_timestamps=raw` was
used to get as accurate timestamp as possible, and average over 3 boots
was taken into account for each measurement. The final improvement was
calculated as (1 - after/before) * 100%, rounded to 0.01%. These are
the results:

* Dell OptiPlex 9010 with Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
  (4 cores, 4 threads): 48.83%
* MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
  20 threads, 6P+8E) `smt=on`: 36.16%
* MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
  20 threads, 6P+8E) `smt=off`: 0.25% (parking takes a lot of additional
  time)
* HP t630 Thin Client with AMD Embedded G-Series GX-420GI Radeon R7E
  (4 cores, 4 threads): 68.00%

Krystian Hebel (9):
  x86/boot: choose AP stack based on APIC ID
  x86: don't access x86_cpu_to_apicid[] directly, use
    cpu_physical_id(cpu)
  x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  x86/smp: move stack_base to cpu_data
  x86/smp: call x2apic_ap_setup() earlier
  x86/shutdown: protect against recurrent machine_restart()
  x86/smp: drop booting_cpu variable
  x86/smp: make cpu_state per-CPU
  x86/smp: start APs in parallel during boot

 xen/arch/x86/acpi/cpu_idle.c          |   4 +-
 xen/arch/x86/acpi/lib.c               |   2 +-
 xen/arch/x86/apic.c                   |   2 +-
 xen/arch/x86/boot/trampoline.S        |  20 +++
 xen/arch/x86/boot/x86_64.S            |  34 ++++-
 xen/arch/x86/cpu/mwait-idle.c         |   4 +-
 xen/arch/x86/domain.c                 |   2 +-
 xen/arch/x86/include/asm/asm_defns.h  |   2 +-
 xen/arch/x86/include/asm/cpufeature.h |   2 +
 xen/arch/x86/include/asm/processor.h  |   2 +
 xen/arch/x86/include/asm/smp.h        |   7 +-
 xen/arch/x86/mpparse.c                |   6 +-
 xen/arch/x86/numa.c                   |  17 +--
 xen/arch/x86/platform_hypercall.c     |   2 +-
 xen/arch/x86/setup.c                  |  25 +++-
 xen/arch/x86/shutdown.c               |  20 ++-
 xen/arch/x86/smpboot.c                | 207 ++++++++++++++++----------
 xen/arch/x86/spec_ctrl.c              |   2 +-
 xen/arch/x86/sysctl.c                 |   2 +-
 xen/arch/x86/traps.c                  |   4 +-
 xen/arch/x86/x86_64/asm-offsets.c     |   5 +-
 xen/include/xen/smp.h                 |   2 -
 22 files changed, 248 insertions(+), 125 deletions(-)


base-commit: fb41228ececea948c7953c8c16fe28fd65c6536b
prerequisite-patch-id: 142a87c707411d49e136c3fb76f1b14963ec6dc8
prerequisite-patch-id: f155cb7e2761deec26b76f1b8b587bc56a404c80
-- 
2.41.0
Re: [XEN PATCH 0/9] x86: parallelize AP bring-up during boot
Posted by Julien Grall 5 months, 2 weeks ago
Hi Krystian,

Thanks you for sending the series! Just posting some extra data point.

On 14/11/2023 17:49, Krystian Hebel wrote:
> Patch series available on this branch:
> https://github.com/TrenchBoot/xen/tree/smp_cleanup_upstreaming
> 
> This series makes AP bring-up parallel on x86. This reduces number of
> IPIs (and more importantly, delays between them) required to start all
> logical processors significantly.
> 
> In order to make it possible, some state variables that were global
> had to be made per-CPU. In most cases, accesses to those variables can
> be performed through helper macros, some of them existed before in a
> different form.
> 
> In addition to work required for parallel initialization, I've fixed
> issues in error path around `machine_restart()` that were discovered
> during implementation and testing.
> 
> CPU hotplug should not be affected, but I had no way of testing it.
> During wakeup from S3 APs are started one by one. It should be possible
> to enable parallel execution there as well, but I don't have a way of
> testing it as of now.
> 
> To measure the improvement, I added output lines (identical for before
> and after changes so there is no impact of printing over serial) just
> above and below `if ( !pv_shim )` block. `console_timestamps=raw` was
> used to get as accurate timestamp as possible, and average over 3 boots
> was taken into account for each measurement. The final improvement was
> calculated as (1 - after/before) * 100%, rounded to 0.01%. These are
> the results:
> 
> * Dell OptiPlex 9010 with Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
>    (4 cores, 4 threads): 48.83%
> * MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
>    20 threads, 6P+8E) `smt=on`: 36.16%
> * MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
>    20 threads, 6P+8E) `smt=off`: 0.25% (parking takes a lot of additional
>    time)
> * HP t630 Thin Client with AMD Embedded G-Series GX-420GI Radeon R7E
>    (4 cores, 4 threads): 68.00%

Your series reminded me some optimization we did at AWS in the SMP boot 
code. This hasn't yet been sent upstrema so I thought I would compare to 
your series to see if there are any differences.

Instead of adding support for parallel SMP boot, we decided to optimize 
some of the wait call (see diff below).

This was tested on a nested environment (KVM/QEMU) on c5 metal with 
x2apic disabled. The numbers are for bringing-up 95 CPUs:

   * Currently: 2s
   * With AWS change only: 300ms
   * With your change only: 100ms

So your approach is superior :). I see you are dropping the loop in 
smp_callin(). So the first hunk in the below diff is not necessary 
anymore. The second hunk probably still makes sense for CPU hotplug (and 
maybe S3 bring-up) even though it would only save 10ms. I will look to 
send the patch.

Diff:

diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 3a1a659082c6..86fd5500e1ea 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -172,9 +172,9 @@ static void smp_callin(void)
      Dprintk("Waiting for CALLOUT.\n");
      for ( i = 0; cpu_state != CPU_STATE_CALLOUT; i++ )
      {
-        BUG_ON(i >= 200);
+        BUG_ON(i >= 200000);
          cpu_relax();
-        mdelay(10);
+        udelay(10);
      }

      /*
@@ -430,6 +430,10 @@ static int wakeup_secondary_cpu(int phys_apicid, 
unsigned long start_eip)
       * Refer to AMD APM Vol2 15.27 "Secure Startup with SKINIT".
       */
      bool send_INIT = ap_boot_method != AP_BOOT_SKINIT;
+    bool modern_cpu = ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
+                       (boot_cpu_data.x86 == 6)) ||
+                      ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) &&
+                       (boot_cpu_data.x86 >= 0xF));

      /*
       * Some versions of tboot might be able to handle the entire wake 
sequence
@@ -464,7 +468,15 @@ static int wakeup_secondary_cpu(int phys_apicid, 
unsigned long start_eip)
                  send_status = apic_read(APIC_ICR) & APIC_ICR_BUSY;
              } while ( send_status && (timeout++ < 1000) );

-            mdelay(10);
+            /*
+             * The Multiprocessor Specification 1.4 (1997) example code 
suggests
+             * that there should be a 10ms delay between the BSP 
asserting INIT
+             * and de-asserting INIT, when starting a remote processor.
+             * But that slows boot and resume on modern processors, 
which include
+             * many cores and don't require that delay.
+             */
+            if ( !modern_cpu )
+                mdelay(10);

              Dprintk("Deasserting INIT.\n");

Cheers,

-- 
Julien Grall