[v3] xen: fix HVM kexec kernel panic

[PATCH v3 0/1] xen: fix HVM kexec kernel panic

Posted by Dongli Zhang 3 years, 11 months ago

This is the v3 of the patch to fix xen kexec kernel panic issue when the
kexec is triggered on VCPU >= 32.

PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
[    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
[    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
... ...
[    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX: 0000000000000000
[    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
[    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
[    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
[    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
[    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
[    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.000000] Call Trace:
[    0.000000]  <TASK>
[    0.000000]  ? xen_clocksource_read+0x24/0x40
[    0.000000]  ? xen_init_time_common+0x5/0x49
[    0.000000]  ? xen_hvm_init_time_ops+0x23/0x45
[    0.000000]  ? xen_hvm_guest_init+0x221/0x25c
[    0.000000]  ? 0xffffffffa9600000
[    0.000000]  ? setup_arch+0x440/0xbd6
[    0.000000]  ? start_kernel+0x6c/0x695
[    0.000000]  ? secondary_startup_64_no_verify+0xd5/0xdb
[    0.000000]  </TASK>


Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
    for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
    code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
  - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
  - Add commit message that why PVM is not supported by this patch
  - Test if kexec/kdump works with mainline xen (HVM and PVM)


I have delayed the xen_hvm_init_time_ops() for all VCPUs. Unfortunately,
now I am able to reproduce the clock backward as shown below on some old
versions of xen. I am not able to reproduce on most recent mainline xen.

[    0.359687] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 [0] 24 25 26 27 28 29 30 31
[    0.359694] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 [0] 40 41 42 43 44 45 46 47
[    0.359701] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 [0] 56 57 58 59 60 61 62 63

... clock backward after the clocksource is switched from native to xen...

[    0.000004] Fallback order for Node 0: 0
[    0.002967] Built 1 zonelists, mobility grouping on.  Total pages: 3527744
[    0.007129] Policy zone: Normal
[    0.008937] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 console=tty1 earlyprintk=tty    S0,115200n8 loglevel=7 no_timer_check reboot=s32 splash crashkernel=512M-:192M vt.handoff=1
[    0.023880] Unknown kernel command line parameters "text splash BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty", will be passed to user space.
[    0.032647] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.036828] printk: log_buf_len total cpu_extra contributions: 258048 bytes
[    0.041049] printk: log_buf_len min size: 262144 bytes
[    0.044481] printk: log_buf_len: 524288 bytes


Since now I am able to reproduce the clock backward on old xen version,
please let me know if I should re-use the v2 of this patch, as it has been
running well in our env well for very long time.

https://lore.kernel.org/all/20211028012543.8776-1-dongli.zhang@oracle.com/


BTW, I have tested that 'soft_reset' does not work with mainline xen, even
when I directly trigger kexec with below commands.

# kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
        --initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
        --reuse-cmdline
# kexec -e


Thank you very much!

Dongli Zhang

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Posted by Boris Ostrovsky 3 years, 11 months ago

On 2/24/22 4:50 PM, Dongli Zhang wrote:
> This is the v3 of the patch to fix xen kexec kernel panic issue when the
> kexec is triggered on VCPU >= 32.
>
> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
> ... ...
> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX: 0000000000000000
> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000) knlGS:0000000000000000
> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.000000] Call Trace:
> [    0.000000]  <TASK>
> [    0.000000]  ? xen_clocksource_read+0x24/0x40


This is done to set xen_sched_clock_offset which I think will not be used for a while, until sched_clock is called (and the other two uses are for suspend/resume)


Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until after all vcpu areas are properly set? Or are there other uses of xen_clocksource_read() before ?


-boris


> [    0.000000]  ? xen_init_time_common+0x5/0x49
> [    0.000000]  ? xen_hvm_init_time_ops+0x23/0x45
> [    0.000000]  ? xen_hvm_guest_init+0x221/0x25c
> [    0.000000]  ? 0xffffffffa9600000
> [    0.000000]  ? setup_arch+0x440/0xbd6
> [    0.000000]  ? start_kernel+0x6c/0x695
> [    0.000000]  ? secondary_startup_64_no_verify+0xd5/0xdb
> [    0.000000]  </TASK>
>
>
> Changed since v1:
>    - Add commit message to explain why xen_hvm_init_time_ops() is delayed
>      for any vcpus. (Suggested by Boris Ostrovsky)
>    - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
>      code in xen_hvm_guest_init(). (suggested by Juergen Gross)
> Changed since v2:
>    - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
>    - Add commit message that why PVM is not supported by this patch
>    - Test if kexec/kdump works with mainline xen (HVM and PVM)
>
>
> I have delayed the xen_hvm_init_time_ops() for all VCPUs. Unfortunately,
> now I am able to reproduce the clock backward as shown below on some old
> versions of xen. I am not able to reproduce on most recent mainline xen.
>
> [    0.359687] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 [0] 24 25 26 27 28 29 30 31
> [    0.359694] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 [0] 40 41 42 43 44 45 46 47
> [    0.359701] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 [0] 56 57 58 59 60 61 62 63
>
> ... clock backward after the clocksource is switched from native to xen...
>
> [    0.000004] Fallback order for Node 0: 0
> [    0.002967] Built 1 zonelists, mobility grouping on.  Total pages: 3527744
> [    0.007129] Policy zone: Normal
> [    0.008937] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 console=tty1 earlyprintk=tty    S0,115200n8 loglevel=7 no_timer_check reboot=s32 splash crashkernel=512M-:192M vt.handoff=1
> [    0.023880] Unknown kernel command line parameters "text splash BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty", will be passed to user space.
> [    0.032647] printk: log_buf_len individual max cpu contribution: 4096 bytes
> [    0.036828] printk: log_buf_len total cpu_extra contributions: 258048 bytes
> [    0.041049] printk: log_buf_len min size: 262144 bytes
> [    0.044481] printk: log_buf_len: 524288 bytes
>
>
> Since now I am able to reproduce the clock backward on old xen version,
> please let me know if I should re-use the v2 of this patch, as it has been
> running well in our env well for very long time.
>
> https://lore.kernel.org/all/20211028012543.8776-1-dongli.zhang@oracle.com/
>
>
> BTW, I have tested that 'soft_reset' does not work with mainline xen, even
> when I directly trigger kexec with below commands.
>
> # kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
>          --initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
>          --reuse-cmdline
> # kexec -e
>
>
> Thank you very much!
>
> Dongli Zhang
>
>

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Posted by Dongli Zhang 3 years, 11 months ago

Hi Boris,

On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
> 
> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>> kexec is triggered on VCPU >= 32.
>>
>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>> ... ...
>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>> 0000000000000000
>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>> knlGS:0000000000000000
>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [    0.000000] Call Trace:
>> [    0.000000]  <TASK>
>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
> 
> 
> This is done to set xen_sched_clock_offset which I think will not be used for a
> while, until sched_clock is called (and the other two uses are for suspend/resume)
> 
> 
> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
> after all vcpu areas are properly set? Or are there other uses of
> xen_clocksource_read() before ?
> 

I have tested that below patch will panic kdump kernel.

diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..6a0c99941ae1 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,8 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
         */
        xen_vcpu_setup(0);

+       xen_init_sched_clock_offset();
+
        /*
         * The alternative logic (which patches the unlock/lock) runs before
         * the smp bootup up code is activated. Hence we need to set this up
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d9c945ee1100..8a2eafa0c215 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -520,9 +520,14 @@ static void __init xen_time_init(void)
                pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 }

-static void __init xen_init_time_common(void)
+void xen_init_sched_clock_offset(void)
 {
        xen_sched_clock_offset = xen_clocksource_read();
+}
+
+static void __init xen_init_time_common(void)
+{
+       xen_sched_clock_offset = 0;
        static_call_update(pv_steal_clock, xen_steal_clock);
        paravirt_set_sched_clock(xen_sched_clock);

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index fd0fec6e92f4..9f7656214dfb 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -69,6 +69,7 @@ void xen_teardown_timer(int cpu);
 void xen_setup_cpu_clockevents(void);
 void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
+void xen_init_sched_clock_offset(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);



Unfortunately, I am not able to obtain the panic callstack from kdump time this
time. I have only below.

PANIC: early exception 0x0e IP 10:ffffffffa6c679b6 error 0 cr2 0x20
PANIC: early exception 0x0e IP 10:ffffffffa6c679b6 error 0 cr2 0x20



The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). Any printk should use sched_clock()
to obtain the timestamp.

vprintk_store()
-> local_clock()
   -> sched_clock()
      -> paravirt_sched_clock()
         -> xen_sched_clock()
            -> xen_clocksource_read()


AFAIR, we started to encounter the issue since commit 857baa87b642
("sched/clock: Enable sched clock early"). kdump used to work well before that
commit.

Thank you very much!

Dongli Zhang

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Posted by Boris Ostrovsky 3 years, 11 months ago


On 2/25/22 8:17 PM, Dongli Zhang wrote:
> Hi Boris,
> 
> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>
>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>> kexec is triggered on VCPU >= 32.
>>>
>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>> ... ...
>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>> 0000000000000000
>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>> knlGS:0000000000000000
>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    0.000000] Call Trace:
>>> [    0.000000]  <TASK>
>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>
>>
>> This is done to set xen_sched_clock_offset which I think will not be used for a
>> while, until sched_clock is called (and the other two uses are for suspend/resume)
>>
>>
>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>> after all vcpu areas are properly set? Or are there other uses of
>> xen_clocksource_read() before ?
>>
> 
> I have tested that below patch will panic kdump kernel.
> 



Oh well, so much for that then. Yes, sched_clock() is at least called from printk path.


I guess we will have to go with v2 then, we don't want to start seeing time going back, even if only with older hypervisors. The only thing I might ask is that you roll the logic inside xen_hvm_init_time_ops(). Something like


xen_hvm_init_time_ops()
{
	/*
	 * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
	 * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
	 */
	if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
		return;

	...

-boris

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Posted by Dongli Zhang 3 years, 11 months ago

Hi Boris,

On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
> 
> 
> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>
>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>> kexec is triggered on VCPU >= 32.
>>>>
>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>> ... ...
>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>> 0000000000000000
>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>> 0000000000000002
>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>> 0000000000000020
>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>> 0000000000000001
>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>> 0000000040000004
>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>>> knlGS:0000000000000000
>>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>> 00000000000406a0
>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> [    0.000000] Call Trace:
>>>> [    0.000000]  <TASK>
>>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>>
>>>
>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>> while, until sched_clock is called (and the other two uses are for
>>> suspend/resume)
>>>
>>>
>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>> after all vcpu areas are properly set? Or are there other uses of
>>> xen_clocksource_read() before ?
>>>
>>
>> I have tested that below patch will panic kdump kernel.
>>
> 
> 
> 
> Oh well, so much for that then. Yes, sched_clock() is at least called from
> printk path.
> 
> 
> I guess we will have to go with v2 then, we don't want to start seeing time
> going back, even if only with older hypervisors. The only thing I might ask is
> that you roll the logic inside xen_hvm_init_time_ops(). Something like
> 
> 
> xen_hvm_init_time_ops()
> {
>     /*
>      * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>      * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>      */
>     if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>         return;
> 

I think you meant __this_cpu_read(xen_vcpu).

I will call xen_hvm_init_time_ops() at both places, and move the logic into
xen_hvm_init_time_ops().

Thank you very much!

Dongli Zhang

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Posted by Dongli Zhang 3 years, 11 months ago

Hi Boris,

On 2/28/22 5:18 PM, Dongli Zhang wrote:
> Hi Boris,
> 
> On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>>
>>
>> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>>> Hi Boris,
>>>
>>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>>
>>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>>> kexec is triggered on VCPU >= 32.
>>>>>
>>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>>> ... ...
>>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>>> 0000000000000000
>>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>>> 0000000000000002
>>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>>> 0000000000000020
>>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>>> 0000000000000001
>>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>>> 0000000040000004
>>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>>> 0000000000000000
>>>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>>>> knlGS:0000000000000000
>>>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>>> 00000000000406a0
>>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>> 0000000000000000
>>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>> 0000000000000400
>>>>> [    0.000000] Call Trace:
>>>>> [    0.000000]  <TASK>
>>>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>>>
>>>>
>>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>>> while, until sched_clock is called (and the other two uses are for
>>>> suspend/resume)
>>>>
>>>>
>>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>>> after all vcpu areas are properly set? Or are there other uses of
>>>> xen_clocksource_read() before ?
>>>>
>>>
>>> I have tested that below patch will panic kdump kernel.
>>>
>>
>>
>>
>> Oh well, so much for that then. Yes, sched_clock() is at least called from
>> printk path.
>>
>>
>> I guess we will have to go with v2 then, we don't want to start seeing time
>> going back, even if only with older hypervisors. The only thing I might ask is
>> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>>
>>
>> xen_hvm_init_time_ops()
>> {
>>     /*
>>      * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>>      * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>>      */
>>     if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>>         return;
>>
> 
> I think you meant __this_cpu_read(xen_vcpu).
> 
> I will call xen_hvm_init_time_ops() at both places, and move the logic into
> xen_hvm_init_time_ops().
> 
> Thank you very much!
> 
> Dongli Zhang
> 


How about we do not move the logic into xen_hvm_init_time_ops()?

Suppose we move the logic into xen_hvm_init_time_ops() line 573, the line line
570 might be printed twice.


559 void __init xen_hvm_init_time_ops(void)
560 {
561         /*
562          * vector callback is needed otherwise we cannot receive interrupts
563          * on cpu > 0 and at this point we don't know how many cpus are
564          * available.
565          */
566         if (!xen_have_vector_callback)
567                 return;
568
569         if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
570                 pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
571                 return;
572         }
573
574         xen_init_time_common();
575
576         x86_init.timers.setup_percpu_clockev = xen_time_init;
577         x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
578
579         x86_platform.set_wallclock = xen_set_wallclock;
580 }

I feel the code looks better if we keep the logic at caller side. Would you mind
letting me know your feedback?

Thank you very much!

Dongli Zhang

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Posted by Boris Ostrovsky 3 years, 11 months ago

On 2/28/22 11:56 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 2/28/22 5:18 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>>>
>>> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>>>> Hi Boris,
>>>>
>>>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>>>> kexec is triggered on VCPU >= 32.
>>>>>>
>>>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>>>> ... ...
>>>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>>>> 0000000000000000
>>>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>>>> 0000000000000002
>>>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>>>> 0000000000000020
>>>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>>>> 0000000000000001
>>>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>>>> 0000000040000004
>>>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>>>> 0000000000000000
>>>>>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffffab588000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>>>> 00000000000406a0
>>>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>> 0000000000000000
>>>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>> 0000000000000400
>>>>>> [    0.000000] Call Trace:
>>>>>> [    0.000000]  <TASK>
>>>>>> [    0.000000]  ? xen_clocksource_read+0x24/0x40
>>>>>
>>>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>>>> while, until sched_clock is called (and the other two uses are for
>>>>> suspend/resume)
>>>>>
>>>>>
>>>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>>>> after all vcpu areas are properly set? Or are there other uses of
>>>>> xen_clocksource_read() before ?
>>>>>
>>>> I have tested that below patch will panic kdump kernel.
>>>>
>>>
>>>
>>> Oh well, so much for that then. Yes, sched_clock() is at least called from
>>> printk path.
>>>
>>>
>>> I guess we will have to go with v2 then, we don't want to start seeing time
>>> going back, even if only with older hypervisors. The only thing I might ask is
>>> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>>>
>>>
>>> xen_hvm_init_time_ops()
>>> {
>>>      /*
>>>       * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>>>       * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>>>       */
>>>      if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>>>          return;
>>>
>> I think you meant __this_cpu_read(xen_vcpu).
>>
>> I will call xen_hvm_init_time_ops() at both places, and move the logic into
>> xen_hvm_init_time_ops().
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>
> How about we do not move the logic into xen_hvm_init_time_ops()?
>
> Suppose we move the logic into xen_hvm_init_time_ops() line 573, the line line
> 570 might be printed twice.


You would need to make sure the routine is executed only once so something like a local static variable would be needed.


>
>
> 559 void __init xen_hvm_init_time_ops(void)
> 560 {
> 561         /*
> 562          * vector callback is needed otherwise we cannot receive interrupts
> 563          * on cpu > 0 and at this point we don't know how many cpus are
> 564          * available.
> 565          */
> 566         if (!xen_have_vector_callback)
> 567                 return;
> 568
> 569         if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
> 570                 pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> 571                 return;
> 572         }
> 573
> 574         xen_init_time_common();
> 575
> 576         x86_init.timers.setup_percpu_clockev = xen_time_init;
> 577         x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
> 578
> 579         x86_platform.set_wallclock = xen_set_wallclock;
> 580 }
>
> I feel the code looks better if we keep the logic at caller side. Would you mind
> letting me know your feedback?


My preference is to keep logic concentrated in one place whenever possible.


-boris