[v2] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving

[RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG

Posted by David Woodhouse 1 year, 2 months ago

From: David Woodhouse <dwmw@amazon.co.uk>

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kernel/relocate_kernel_64.S | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 67f6853c7abe..ebbd76c9a3e9 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -14,6 +14,8 @@
 #include <asm/nospec-branch.h>
 #include <asm/unwind_hints.h>
 
+#define DEBUG
+
 /*
  * Must be relocatable PIC code callable as a C function, in particular
  * there must be a plain RET and not jump to return thunk.
@@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	pushw	$0xff
 	lidt	(%rsp)
 	addq	$10, %rsp
+
+	int3
 #endif /* DEBUG */
 
 	/*
-- 
2.47.0

Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG

Posted by Ingo Molnar 1 year, 2 months ago

* David Woodhouse <dwmw2@infradead.org> wrote:

> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kernel/relocate_kernel_64.S | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 67f6853c7abe..ebbd76c9a3e9 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -14,6 +14,8 @@
>  #include <asm/nospec-branch.h>
>  #include <asm/unwind_hints.h>
>  
> +#define DEBUG
> +
>  /*
>   * Must be relocatable PIC code callable as a C function, in particular
>   * there must be a plain RET and not jump to return thunk.
> @@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	pushw	$0xff
>  	lidt	(%rsp)
>  	addq	$10, %rsp
> +
> +	int3
>  #endif /* DEBUG */

That's a really nice piece of debugging code written in assembly, 
combined with the exception handling feature that generates debug 
output to begin with. Epic effort. :-)

Just curious: did you write this code to debug the series, or was there 
some original hair-tearing regression that motivated you? Is there's an 
upstream fix to marvel at and be horrified about in equal measure?

I'd argue that this debugging code probably needs a default-off Kconfig 
option, even with the obvious hard-coded environmental limitations & 
assumptions it has. Could be useful to very early debugging & would 
preserve your effort without it bitrotting too obviously.

Thanks,

	Ingo

Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG

Posted by David Woodhouse 1 year, 2 months ago

On Mon, 2024-11-25 at 10:21 +0100, Ingo Molnar wrote:
> 
> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >  arch/x86/kernel/relocate_kernel_64.S | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > index 67f6853c7abe..ebbd76c9a3e9 100644
> > --- a/arch/x86/kernel/relocate_kernel_64.S
> > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > @@ -14,6 +14,8 @@
> >  #include <asm/nospec-branch.h>
> >  #include <asm/unwind_hints.h>
> >  
> > +#define DEBUG
> > +
> >  /*
> >   * Must be relocatable PIC code callable as a C function, in particular
> >   * there must be a plain RET and not jump to return thunk.
> > @@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >  	pushw	$0xff
> >  	lidt	(%rsp)
> >  	addq	$10, %rsp
> > +
> > +	int3
> >  #endif /* DEBUG */
> 
> That's a really nice piece of debugging code written in assembly, 
> combined with the exception handling feature that generates debug 
> output to begin with. Epic effort. :-)

Thanks :)

> Just curious: did you write this code to debug the series, or was there 
> some original hair-tearing regression that motivated you? Is there's an 
> upstream fix to marvel at and be horrified about in equal measure?

https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/t/#u
is the upstream fix. It's all the more horrifying because it was
already *fixed* upstream before I lost weeks of my life to chasing it.
And the trigger which actually made it *happen*, and made our
production systems allocate memory within that dangerous 1MiB region
adjacent to the RMP table, was a tweak to the NMI watchdog period...
leading to an assumption that we were getting stray perf NMIs during
the kexec, and a *long* wild goose chase based on that false
assumption...

Once I'd written the debug code, I just wanted to clean it up a bit and
push it out for the benefit of others; that *was* the main point of
this series. All the rest of the cleanups are just yak shaving.

The realisation that we never even explicitly mapped the control code
page and always just got lucky because it happened to be in the same
2MiB or 1GiB superpage as something else that we did map... was just a
bonus :)

(That one is fixed in v3 which I'll post shortly, and is already in 
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kexec-debug
)

> I'd argue that this debugging code probably needs a default-off Kconfig 
> option, even with the obvious hard-coded environmental limitations & 
> assumptions it has. Could be useful to very early debugging & would 
> preserve your effort without it bitrotting too obviously.

Yeah. In v3 I've made it a config option, and made it use the
early_printk serial console (as long as that's an I/O based 8250; we
can add others too later).

Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG

Posted by Ingo Molnar 1 year, 2 months ago

* David Woodhouse <dwmw2@infradead.org> wrote:

> > Just curious: did you write this code to debug the series, or was 
> > there some original hair-tearing regression that motivated you? Is 
> > there's an upstream fix to marvel at and be horrified about in 
> > equal measure?
> 
> https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/t/#u
> is the upstream fix.

Which ended up being the following upstream commit:

  88a921aa3c6b ("x86/sev: Ensure that RMP table fixups are reserved")

Might make sense to add this commit reference to one of the central 
patches of the GDT/IDT code, to document how this feature is able to 
pin down very hard to debug regressions. (Even if the upstream fix was 
done independently in probably luckier circumstances.)

> [...] It's all the more horrifying because it was already *fixed* 
> upstream before I lost weeks of my life to chasing it. And the 
> trigger which actually made it *happen*, and made our production 
> systems allocate memory within that dangerous 1MiB region adjacent to 
> the RMP table, was a tweak to the NMI watchdog period... leading to 
> an assumption that we were getting stray perf NMIs during the kexec, 
> and a *long* wild goose chase based on that false assumption...

:-/

> Once I'd written the debug code, I just wanted to clean it up a bit 
> and push it out for the benefit of others; that *was* the main point 
> of this series. All the rest of the cleanups are just yak shaving.
> 
> The realisation that we never even explicitly mapped the control code 
> page and always just got lucky because it happened to be in the same 
> 2MiB or 1GiB superpage as something else that we did map... was just 
> a bonus :)

I'm amazed and horrified in equal measure ;-)

> (That one is fixed in v3 which I'll post shortly, and is already in 
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kexec-debug
> )
> 
> > I'd argue that this debugging code probably needs a default-off Kconfig 
> > option, even with the obvious hard-coded environmental limitations & 
> > assumptions it has. Could be useful to very early debugging & would 
> > preserve your effort without it bitrotting too obviously.
> 
> Yeah. In v3 I've made it a config option, and made it use the 
> early_printk serial console (as long as that's an I/O based 8250; we 
> can add others too later).

That's lovely!

Thanks,

	Ingo

Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG

Posted by David Woodhouse 1 year, 2 months ago

On Mon, 2024-11-25 at 21:34 +0100, Ingo Molnar wrote:
>  
> > The realisation that we never even explicitly mapped the control code 
> > page and always just got lucky because it happened to be in the same 
> > 2MiB or 1GiB superpage as something else that we did map... was just 
> > a bonus :)
> 
> I'm amazed and horrified in equal measure ;-)

:)

The rest of today was dedicated to finding out that that isn't entirely
true. Mapping the control page explicitly was only helping because it
forced 2MiB mappings instead of a 1GiB mapping, and masked the fact
that PTI was causing the identmap code to scribble off the end of the
root PGD page...

It all just worked by pure fluke on x86_64 before, because x86_64 would
allocate a 8KiB control region and use the first half of it for the
PGD, and *then* copy the trampoline code into the second half, after
the identmap code had finished scribbling on it. So when I cleaned that
up to allocate the PGD separately and explicitly like i386 does, that's
why it exploded; not just due to allocation patterns.

Still, I think I have a handle on fairly much everything that's broken,
except the occasional warning on the way back from
KEXEC_PRESERVE_CONTEXT thus:

[    1.423464] ------------[ cut here ]------------
[    1.423950] Interrupts enabled after irqrouter_resume+0x0/0x50
[    1.424605] WARNING: CPU: 0 PID: 215 at drivers/base/syscore.c:103 syscore_resume+0x152/0x180
[    1.425467] Modules linked in:
[    1.425791] CPU: 0 UID: 0 PID: 215 Comm: kexec Not tainted 6.12.0-rc5+ #2015
[    1.426498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    1.427628] RIP: 0010:syscore_resume+0x152/0x180
[    1.428101] Code: 00 e9 e1 fe ff ff 80 3d b1 b8 c4 01 00 0f 85 21 ff ff ff 48 8b 73 18 48 c7 c7 32 b8 b6 ac c6 05 99 b8 c4 01 01 e8 9e 3f 55 ff <0f> 0b e9 03 ff ff ff 80 3d 87 b8 c4 01 00 0f 85 b8 fe ff ff 48 c7
[    1.429913] RSP: 0018:ffffae9bc03bfd00 EFLAGS: 00010282
[    1.430445] RAX: 0000000000000000 RBX: ffffffffad6fbb20 RCX: ffffffffad5636a8
[    1.431153] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
[    1.431869] RBP: 0000000028121969 R08: 0000000000000000 R09: 0000000000000000
[    1.432594] R10: ffffae9bc03bfaa8 R11: 7075727265746e49 R12: ffffae9bc03bfd28
[    1.433313] R13: ffffffffad471f60 R14: 00000000fee1dead R15: 0000000000000000
[    1.434021] FS:  00007f77d4a45740(0000) GS:ffff91d0fd600000(0000) knlGS:0000000000000000
[    1.434815] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.435385] CR2: 00007f7e011e7070 CR3: 00000000012fe001 CR4: 0000000000170ef0
[    1.436073] Call Trace:
[    1.436334]  <TASK>
[    1.436558]  ? syscore_resume+0x152/0x180
[    1.436956]  ? __warn.cold+0x93/0xfa
[    1.437319]  ? syscore_resume+0x152/0x180
[    1.437717]  ? report_bug+0xff/0x140
[    1.438075]  ? handle_bug+0x58/0x90
[    1.438438]  ? exc_invalid_op+0x17/0x70
[    1.438826]  ? asm_exc_invalid_op+0x1a/0x20
[    1.439246]  ? syscore_resume+0x152/0x180
[    1.439644]  kernel_kexec+0x10a/0x160
[    1.440010]  __do_sys_reboot+0x1fd/0x240
[    1.440485]  do_syscall_64+0x82/0x160
[    1.440863]  ? syscall_exit_to_user_mode+0x10/0x210
[    1.441351]  ? do_syscall_64+0x8e/0x160
[    1.441735]  ? exc_page_fault+0x7e/0x180
[    1.442123]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    1.442623] RIP: 0033:0x7f77d4b5adb7
[    1.442992] Code: c7 c0 ff ff ff ff eb be 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 49 50 0c 00 f7 d8 64 89 02 b8
[    1.444757] RSP: 002b:00007ffc56bc30f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[    1.445493] RAX: ffffffffffffffda RBX: 00007ffc56bc3260 RCX: 00007f77d4b5adb7
[    1.446173] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[    1.446848] RBP: 00007ffc56bc32c0 R08: 000055e1cef3e010 R09: 0000000000000007
[    1.447527] R10: 000055e1cef41020 R11: 0000000000000246 R12: 0000000000000001
[    1.448219] R13: 000055e19046b896 R14: 000055e1cef3e4a0 R15: 0000000000000000
[    1.448893]  </TASK>
[    1.449119] ---[ end trace 0000000000000000 ]---
[    1.452539] Enabling non-boot CPUs ...
[    1.452935] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[    1.453678] smpboot: Booting Node 0 Processor 1 APIC 0x1
[    1.455531] CPU1 is up
[    1.460031] virtio_blk virtio1: 2/0/0 default/read/poll queues
[    1.465246] OOM killer enabled.
[    1.465580] Restarting tasks ... done.