[PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall

Roger Pau Monne posted 4 patches 5 months, 1 week ago
There is a newer version of this series
[PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall
Posted by Roger Pau Monne 5 months, 1 week ago
With the current AP bring up code, Xen can get stuck indefinitely if an AP
freezes during boot after the 'callin' step.  Introduce a 5s timeout while
waiting for APs to finish startup.

On failure of an AP to complete startup, send an NMI to trigger the
printing of a stack backtrace on the stuck AP and panic on the BSP.

This patch was done while investigating the issue caused by Intel erratum
ICX143.  It wasn't helpful in that case, but it's still and improvement
when debugging AP bring up related issues.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Use 5s timeout.
 - Print APICID.
 - Split NMI dispatch code movement to a pre-patch.
 - Reorder timeout check condition.
---
 xen/arch/x86/smpboot.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index dbc2f2f1d411..50c5674555e4 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1372,6 +1372,7 @@ int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm)
 int __cpu_up(unsigned int cpu)
 {
     int apicid, ret;
+    s_time_t start;
 
     if ( (apicid = x86_cpu_to_apicid[cpu]) == BAD_APICID )
         return -ENODEV;
@@ -1390,10 +1391,17 @@ int __cpu_up(unsigned int cpu)
     time_latch_stamps();
 
     set_cpu_state(CPU_STATE_ONLINE);
+    start = NOW();
     while ( !cpu_online(cpu) )
     {
         cpu_relax();
         process_pending_softirqs();
+        if ( (NOW() - start) > SECONDS(5) )
+        {
+            /* AP is stuck, send NMI and panic. */
+            show_execution_state_nmi(cpumask_of(cpu), true);
+            panic("CPU%u/APICID%u: Stuck while starting up\n", cpu, apicid);
+        }
     }
 
     return 0;
-- 
2.49.0


Re: [PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall
Posted by Andrew Cooper 5 months, 1 week ago
On 22/05/2025 8:54 am, Roger Pau Monne wrote:
> With the current AP bring up code, Xen can get stuck indefinitely if an AP
> freezes during boot after the 'callin' step.  Introduce a 5s timeout while
> waiting for APs to finish startup.
>
> On failure of an AP to complete startup, send an NMI to trigger the
> printing of a stack backtrace on the stuck AP and panic on the BSP.
>
> This patch was done while investigating the issue caused by Intel erratum
> ICX143.  It wasn't helpful in that case, but it's still and improvement
> when debugging AP bring up related issues.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>