[v2] ppc/spapr: Skip system reset for quiesced CPUs

[PATCH v2] ppc/spapr: Skip system reset for quiesced CPUs

Posted by Shivang Upadhyay 2 weeks, 5 days ago

During DLPAR CPU hotplug, newly added CPUs start in RTAS stopped state
(quiesced). If a kexec crash occurs before the guest starts these CPUs
via start-cpu RTAS call, H_SIGNAL_SYS_RESET_ALL_OTHERS will reset them
anyway, causing the kdump kernel to hang:

  [    5.519483][    T1] Processor 0 is stuck.
  [   11.089481][    T1] Processor 1 is stuck.

The hypervisor should only reset CPUs that the guest has started. The
cpu->env.quiesced flag tracks RTAS stopped state - CPUs in this state
are already inactive and should not be reset.

Skip system reset for quiesced CPUs to prevent kdump hangs during CPU
hotplug operations.

Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Harsh Prateek Bora <harshpb@linux.ibm.com>
Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
Reported-by: Anushree Mathur <anushree.mathur@linux.vnet.ibm.com>
Suggested-by: Vishal Chourasia <vishalc@linux.ibm.com>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Shivang Upadhyay <shivangu@linux.ibm.com>
---
Changelog:

v2:
 * added braces to adhere to style guide.
 * rebase to master

v1:
 * https://lore.kernel.org/all/20260430085409.680930-1-shivangu@linux.ibm.com/
---
 hw/ppc/spapr_hcall.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 032805a8d0..613dd893bb 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1105,6 +1105,12 @@ static target_ulong h_signal_sys_reset(PowerPCCPU *cpu,
                     continue;
                 }
             }
+
+            /* Skip quiesced CPUs */
+            if (c->env.quiesced) {
+                continue;
+            }
+
             run_on_cpu(cs, spapr_do_system_reset_on_cpu, RUN_ON_CPU_NULL);
         }
         return H_SUCCESS;
-- 
2.53.0

Re: [PATCH v2] ppc/spapr: Skip system reset for quiesced CPUs

Posted by Anushree Mathur 2 weeks, 2 days ago


On 11/05/26 3:20 PM, Shivang Upadhyay wrote:
> During DLPAR CPU hotplug, newly added CPUs start in RTAS stopped state
> (quiesced). If a kexec crash occurs before the guest starts these CPUs
> via start-cpu RTAS call, H_SIGNAL_SYS_RESET_ALL_OTHERS will reset them
> anyway, causing the kdump kernel to hang:
>
>    [    5.519483][    T1] Processor 0 is stuck.
>    [   11.089481][    T1] Processor 1 is stuck.
>
> The hypervisor should only reset CPUs that the guest has started. The
> cpu->env.quiesced flag tracks RTAS stopped state - CPUs in this state
> are already inactive and should not be reset.
>
> Skip system reset for quiesced CPUs to prevent kdump hangs during CPU
> hotplug operations.
>
> Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
> Cc: Harsh Prateek Bora <harshpb@linux.ibm.com>
> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
> Reported-by: Anushree Mathur <anushree.mathur@linux.vnet.ibm.com>
> Suggested-by: Vishal Chourasia <vishalc@linux.ibm.com>
> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
> Signed-off-by: Shivang Upadhyay <shivangu@linux.ibm.com>
> ---
> Changelog:
>
> v2:
>   * added braces to adhere to style guide.
>   * rebase to master
>
> v1:
>   * https://lore.kernel.org/all/20260430085409.680930-1-shivangu@linux.ibm.com/
> ---
>   hw/ppc/spapr_hcall.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 032805a8d0..613dd893bb 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1105,6 +1105,12 @@ static target_ulong h_signal_sys_reset(PowerPCCPU *cpu,
>                       continue;
>                   }
>               }
> +
> +            /* Skip quiesced CPUs */
> +            if (c->env.quiesced) {
> +                continue;
> +            }
> +
>               run_on_cpu(cs, spapr_do_system_reset_on_cpu, RUN_ON_CPU_NULL);
>           }
>           return H_SUCCESS;


Hi Shivang,
thanks for working on the reported issue. After applying the patch I am 
seeing that this reported issue has been fixed which was guest getting 
hung after
triggering kdump on guest while cpu hotplug is going on, but I am seeing 
multiple other issues with multiple attempts with the same
scenario and one of the major issue that I have seen is a qemu crash. I 
believe this needs to be fixed.


Here is my analysis of this issue with and without the patch:

1) Without applying the patch:

i) Start the guest with maxvcpus as 64 and current vcpus as 8
ii) Start cpu hotplug [virsh setvcpus guest_name 64] same time trigger 
kdump on guest [echo c > /proc/sysrq-trigger]
Guest is getting hung.


[   32.930453][ T1208] NIP [00007fffbe35b3c4] 0x7fffbe35b3c4
[   32.930528][ T1208] LR [00007fffbe35b3c4] 0x7fffbe35b3c4
[   32.930638][ T1208] ---- interrupt: 3000
[    9.857410][    T1] Processor 0 is stuck.



2) After applying the patch


Multiple issues that were seen in multiple attempts of this scenario:


i) In 4th attempt I saw dlpar related traces along with OOPS after 
triggering kdump:

[    6.071156][  T121] pseries-hotplug-cpu: Cannot add cpu 
/cpus/PowerPC,POWER11@20; this system configuration supports 32 logical 
cpus.
[    6.071313][  T121] OF: changeset notifier error 
@/cpus/PowerPC,POWER11@20
[    6.074099][  T121] BUG: Unable to handle kernel data access at 
0x151591241bba0bb6
[    6.074232][  T121] Faulting instruction address: 0xc0000000211e5b98
[    6.074311][  T121] Oops: Kernel access of bad area, sig: 11 [#1]

[    6.076695][  T121] Call Trace:
[    6.076741][  T121] [c000000026a6baf0] [c000000026a6bb30] 
0xc000000026a6bb30 (unreliable)
[    6.076834][  T121] [c000000026a6bb60] [0000000010000021] 0x10000021
[    6.076930][  T121] [c000000026a6bb90] [c000000020e55b14] 
of_get_next_child+0x64/0xd0
[    6.077034][  T121] [c000000026a6bbd0] [c0000000201cd1dc] 
dlpar_cpu_add+0xbc/0x5e0
[    6.077148][  T121] [c000000026a6bcb0] [c0000000201ce9d0] 
dlpar_cpu+0x60/0x1f0
[    6.077241][  T121] [c000000026a6bd40] [c0000000201c5914] 
handle_dlpar_errorlog+0x1f4/0x6e0
[    6.077333][  T121] [c000000026a6be20] [c0000000201c5e28] 
pseries_hp_work_fn+0x28/0x60
[    6.077425][  T121] [c000000026a6be50] [c000000020259e6c] 
process_one_work+0x1dc/0x540
[    6.077516][  T121] [c000000026a6bf00] [c00000002025ae0c] 
worker_thread+0x36c/0x4d0
[    6.077608][  T121] [c000000026a6bf90] [c000000020269978] 
kthread+0x168/0x190
[    6.077700][  T121] [c000000026a6bfe0] [c00000002000de58] 
start_kernel_thread+0x14/0x18

ii) In 7th attempt I saw xive interrupts


[   61.692603][ T1909] ---- interrupt: 3000
[    0.010215][    T1] xive: H_INT_GET_QUEUE_INFO cpu=62 prio=6 failed -55
[    0.013834][    T1] xive: Error -55 getting queue info CPU 62 prio 6



iii) qemu crashed after 10 attempts with the following error message in 
the libvirt/qemu logs

qemu-system-ppc64: ../hw/ppc/spapr.c:4396: spapr_cpu_index_to_props: 
Assertion `core_slot' failed.
2026-05-13 09:46:29.656+0000: shutting down, reason=crashed


Thank you!
Anushree Mathur

Re: [PATCH v2] ppc/spapr: Skip system reset for quiesced CPUs

Posted by Harsh Prateek Bora 1 week ago


On 13/05/26 11:01 pm, Anushree Mathur wrote:
> 
> 
> On 11/05/26 3:20 PM, Shivang Upadhyay wrote:
>> During DLPAR CPU hotplug, newly added CPUs start in RTAS stopped state
>> (quiesced). If a kexec crash occurs before the guest starts these CPUs
>> via start-cpu RTAS call, H_SIGNAL_SYS_RESET_ALL_OTHERS will reset them
>> anyway, causing the kdump kernel to hang:
>>
>>    [    5.519483][    T1] Processor 0 is stuck.
>>    [   11.089481][    T1] Processor 1 is stuck.
>>
>> The hypervisor should only reset CPUs that the guest has started. The
>> cpu->env.quiesced flag tracks RTAS stopped state - CPUs in this state
>> are already inactive and should not be reset.
>>
>> Skip system reset for quiesced CPUs to prevent kdump hangs during CPU
>> hotplug operations.
>>
>> Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
>> Cc: Harsh Prateek Bora <harshpb@linux.ibm.com>
>> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
>> Reported-by: Anushree Mathur <anushree.mathur@linux.vnet.ibm.com>
>> Suggested-by: Vishal Chourasia <vishalc@linux.ibm.com>
>> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
>> Signed-off-by: Shivang Upadhyay <shivangu@linux.ibm.com>
>> ---
>> Changelog:
>>
>> v2:
>>   * added braces to adhere to style guide.
>>   * rebase to master
>>
>> v1:
>>   * https://lore.kernel.org/all/20260430085409.680930-1- 
>> shivangu@linux.ibm.com/
>> ---
>>   hw/ppc/spapr_hcall.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index 032805a8d0..613dd893bb 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -1105,6 +1105,12 @@ static target_ulong 
>> h_signal_sys_reset(PowerPCCPU *cpu,
>>                       continue;
>>                   }
>>               }
>> +
>> +            /* Skip quiesced CPUs */
>> +            if (c->env.quiesced) {
>> +                continue;
>> +            }
>> +
>>               run_on_cpu(cs, spapr_do_system_reset_on_cpu, 
>> RUN_ON_CPU_NULL);
>>           }
>>           return H_SUCCESS;
> 
> 
> Hi Shivang,
> thanks for working on the reported issue. After applying the patch I am 
> seeing that this reported issue has been fixed which was guest getting 
> hung after
> triggering kdump on guest while cpu hotplug is going on, but I am seeing 
> multiple other issues with multiple attempts with the same
> scenario and one of the major issue that I have seen is a qemu crash. I 
> believe this needs to be fixed.
> 

I see the crash mentioned below happened after 10 attempts of kdump,
which could be an unrelated problem that can be addressed separately.
This fix seems appropriate for the original kdump hang issue, just that
a more elobare comment would have been more helpful.

I would expand the comment being added as:

/* Skip quiesced CPUs - they are in RTAS stopped state and
  * should not be reset. This prevents kdump hangs when CPUs
  * are hotplugged but not yet started by the guest.
  */

No need to send another version, I shall update and queue.

Thanks
Harsh


> 
> Here is my analysis of this issue with and without the patch:
> 
> 1) Without applying the patch:
> 
> i) Start the guest with maxvcpus as 64 and current vcpus as 8
> ii) Start cpu hotplug [virsh setvcpus guest_name 64] same time trigger 
> kdump on guest [echo c > /proc/sysrq-trigger]
> Guest is getting hung.
> 
> 
> [   32.930453][ T1208] NIP [00007fffbe35b3c4] 0x7fffbe35b3c4
> [   32.930528][ T1208] LR [00007fffbe35b3c4] 0x7fffbe35b3c4
> [   32.930638][ T1208] ---- interrupt: 3000
> [    9.857410][    T1] Processor 0 is stuck.
> 
> 
> 
> 2) After applying the patch
> 
> 
> Multiple issues that were seen in multiple attempts of this scenario:
> 
> 
> i) In 4th attempt I saw dlpar related traces along with OOPS after 
> triggering kdump:
> 
> [    6.071156][  T121] pseries-hotplug-cpu: Cannot add cpu /cpus/ 
> PowerPC,POWER11@20; this system configuration supports 32 logical cpus.
> [    6.071313][  T121] OF: changeset notifier error @/cpus/ 
> PowerPC,POWER11@20
> [    6.074099][  T121] BUG: Unable to handle kernel data access at 
> 0x151591241bba0bb6
> [    6.074232][  T121] Faulting instruction address: 0xc0000000211e5b98
> [    6.074311][  T121] Oops: Kernel access of bad area, sig: 11 [#1]
> 
> [    6.076695][  T121] Call Trace:
> [    6.076741][  T121] [c000000026a6baf0] [c000000026a6bb30] 
> 0xc000000026a6bb30 (unreliable)
> [    6.076834][  T121] [c000000026a6bb60] [0000000010000021] 0x10000021
> [    6.076930][  T121] [c000000026a6bb90] [c000000020e55b14] 
> of_get_next_child+0x64/0xd0
> [    6.077034][  T121] [c000000026a6bbd0] [c0000000201cd1dc] 
> dlpar_cpu_add+0xbc/0x5e0
> [    6.077148][  T121] [c000000026a6bcb0] [c0000000201ce9d0] 
> dlpar_cpu+0x60/0x1f0
> [    6.077241][  T121] [c000000026a6bd40] [c0000000201c5914] 
> handle_dlpar_errorlog+0x1f4/0x6e0
> [    6.077333][  T121] [c000000026a6be20] [c0000000201c5e28] 
> pseries_hp_work_fn+0x28/0x60
> [    6.077425][  T121] [c000000026a6be50] [c000000020259e6c] 
> process_one_work+0x1dc/0x540
> [    6.077516][  T121] [c000000026a6bf00] [c00000002025ae0c] 
> worker_thread+0x36c/0x4d0
> [    6.077608][  T121] [c000000026a6bf90] [c000000020269978] 
> kthread+0x168/0x190
> [    6.077700][  T121] [c000000026a6bfe0] [c00000002000de58] 
> start_kernel_thread+0x14/0x18
> 
> ii) In 7th attempt I saw xive interrupts
> 
> 
> [   61.692603][ T1909] ---- interrupt: 3000
> [    0.010215][    T1] xive: H_INT_GET_QUEUE_INFO cpu=62 prio=6 failed -55
> [    0.013834][    T1] xive: Error -55 getting queue info CPU 62 prio 6
> 
> 
> 
> iii) qemu crashed after 10 attempts with the following error message in 
> the libvirt/qemu logs
> 
> qemu-system-ppc64: ../hw/ppc/spapr.c:4396: spapr_cpu_index_to_props: 
> Assertion `core_slot' failed.
> 2026-05-13 09:46:29.656+0000: shutting down, reason=crashed
> 
> 
> Thank you!
> Anushree Mathur