From: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 67f6853c7abe..ebbd76c9a3e9 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -14,6 +14,8 @@
#include <asm/nospec-branch.h>
#include <asm/unwind_hints.h>
+#define DEBUG
+
/*
* Must be relocatable PIC code callable as a C function, in particular
* there must be a plain RET and not jump to return thunk.
@@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
pushw $0xff
lidt (%rsp)
addq $10, %rsp
+
+ int3
#endif /* DEBUG */
/*
--
2.47.0
* David Woodhouse <dwmw2@infradead.org> wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > arch/x86/kernel/relocate_kernel_64.S | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S > index 67f6853c7abe..ebbd76c9a3e9 100644 > --- a/arch/x86/kernel/relocate_kernel_64.S > +++ b/arch/x86/kernel/relocate_kernel_64.S > @@ -14,6 +14,8 @@ > #include <asm/nospec-branch.h> > #include <asm/unwind_hints.h> > > +#define DEBUG > + > /* > * Must be relocatable PIC code callable as a C function, in particular > * there must be a plain RET and not jump to return thunk. > @@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) > pushw $0xff > lidt (%rsp) > addq $10, %rsp > + > + int3 > #endif /* DEBUG */ That's a really nice piece of debugging code written in assembly, combined with the exception handling feature that generates debug output to begin with. Epic effort. :-) Just curious: did you write this code to debug the series, or was there some original hair-tearing regression that motivated you? Is there's an upstream fix to marvel at and be horrified about in equal measure? I'd argue that this debugging code probably needs a default-off Kconfig option, even with the obvious hard-coded environmental limitations & assumptions it has. Could be useful to very early debugging & would preserve your effort without it bitrotting too obviously. Thanks, Ingo
On Mon, 2024-11-25 at 10:21 +0100, Ingo Molnar wrote: > > * David Woodhouse <dwmw2@infradead.org> wrote: > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > --- > > arch/x86/kernel/relocate_kernel_64.S | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S > > index 67f6853c7abe..ebbd76c9a3e9 100644 > > --- a/arch/x86/kernel/relocate_kernel_64.S > > +++ b/arch/x86/kernel/relocate_kernel_64.S > > @@ -14,6 +14,8 @@ > > #include <asm/nospec-branch.h> > > #include <asm/unwind_hints.h> > > > > +#define DEBUG > > + > > /* > > * Must be relocatable PIC code callable as a C function, in particular > > * there must be a plain RET and not jump to return thunk. > > @@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) > > pushw $0xff > > lidt (%rsp) > > addq $10, %rsp > > + > > + int3 > > #endif /* DEBUG */ > > That's a really nice piece of debugging code written in assembly, > combined with the exception handling feature that generates debug > output to begin with. Epic effort. :-) Thanks :) > Just curious: did you write this code to debug the series, or was there > some original hair-tearing regression that motivated you? Is there's an > upstream fix to marvel at and be horrified about in equal measure? https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/t/#u is the upstream fix. It's all the more horrifying because it was already *fixed* upstream before I lost weeks of my life to chasing it. And the trigger which actually made it *happen*, and made our production systems allocate memory within that dangerous 1MiB region adjacent to the RMP table, was a tweak to the NMI watchdog period... leading to an assumption that we were getting stray perf NMIs during the kexec, and a *long* wild goose chase based on that false assumption... Once I'd written the debug code, I just wanted to clean it up a bit and push it out for the benefit of others; that *was* the main point of this series. All the rest of the cleanups are just yak shaving. The realisation that we never even explicitly mapped the control code page and always just got lucky because it happened to be in the same 2MiB or 1GiB superpage as something else that we did map... was just a bonus :) (That one is fixed in v3 which I'll post shortly, and is already in https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kexec-debug ) > I'd argue that this debugging code probably needs a default-off Kconfig > option, even with the obvious hard-coded environmental limitations & > assumptions it has. Could be useful to very early debugging & would > preserve your effort without it bitrotting too obviously. Yeah. In v3 I've made it a config option, and made it use the early_printk serial console (as long as that's an I/O based 8250; we can add others too later).
* David Woodhouse <dwmw2@infradead.org> wrote:
> > Just curious: did you write this code to debug the series, or was
> > there some original hair-tearing regression that motivated you? Is
> > there's an upstream fix to marvel at and be horrified about in
> > equal measure?
>
> https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/t/#u
> is the upstream fix.
Which ended up being the following upstream commit:
88a921aa3c6b ("x86/sev: Ensure that RMP table fixups are reserved")
Might make sense to add this commit reference to one of the central
patches of the GDT/IDT code, to document how this feature is able to
pin down very hard to debug regressions. (Even if the upstream fix was
done independently in probably luckier circumstances.)
> [...] It's all the more horrifying because it was already *fixed*
> upstream before I lost weeks of my life to chasing it. And the
> trigger which actually made it *happen*, and made our production
> systems allocate memory within that dangerous 1MiB region adjacent to
> the RMP table, was a tweak to the NMI watchdog period... leading to
> an assumption that we were getting stray perf NMIs during the kexec,
> and a *long* wild goose chase based on that false assumption...
:-/
> Once I'd written the debug code, I just wanted to clean it up a bit
> and push it out for the benefit of others; that *was* the main point
> of this series. All the rest of the cleanups are just yak shaving.
>
> The realisation that we never even explicitly mapped the control code
> page and always just got lucky because it happened to be in the same
> 2MiB or 1GiB superpage as something else that we did map... was just
> a bonus :)
I'm amazed and horrified in equal measure ;-)
> (That one is fixed in v3 which I'll post shortly, and is already in
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kexec-debug
> )
>
> > I'd argue that this debugging code probably needs a default-off Kconfig
> > option, even with the obvious hard-coded environmental limitations &
> > assumptions it has. Could be useful to very early debugging & would
> > preserve your effort without it bitrotting too obviously.
>
> Yeah. In v3 I've made it a config option, and made it use the
> early_printk serial console (as long as that's an I/O based 8250; we
> can add others too later).
That's lovely!
Thanks,
Ingo
On Mon, 2024-11-25 at 21:34 +0100, Ingo Molnar wrote: > > > The realisation that we never even explicitly mapped the control code > > page and always just got lucky because it happened to be in the same > > 2MiB or 1GiB superpage as something else that we did map... was just > > a bonus :) > > I'm amazed and horrified in equal measure ;-) :) The rest of today was dedicated to finding out that that isn't entirely true. Mapping the control page explicitly was only helping because it forced 2MiB mappings instead of a 1GiB mapping, and masked the fact that PTI was causing the identmap code to scribble off the end of the root PGD page... It all just worked by pure fluke on x86_64 before, because x86_64 would allocate a 8KiB control region and use the first half of it for the PGD, and *then* copy the trampoline code into the second half, after the identmap code had finished scribbling on it. So when I cleaned that up to allocate the PGD separately and explicitly like i386 does, that's why it exploded; not just due to allocation patterns. Still, I think I have a handle on fairly much everything that's broken, except the occasional warning on the way back from KEXEC_PRESERVE_CONTEXT thus: [ 1.423464] ------------[ cut here ]------------ [ 1.423950] Interrupts enabled after irqrouter_resume+0x0/0x50 [ 1.424605] WARNING: CPU: 0 PID: 215 at drivers/base/syscore.c:103 syscore_resume+0x152/0x180 [ 1.425467] Modules linked in: [ 1.425791] CPU: 0 UID: 0 PID: 215 Comm: kexec Not tainted 6.12.0-rc5+ #2015 [ 1.426498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1.427628] RIP: 0010:syscore_resume+0x152/0x180 [ 1.428101] Code: 00 e9 e1 fe ff ff 80 3d b1 b8 c4 01 00 0f 85 21 ff ff ff 48 8b 73 18 48 c7 c7 32 b8 b6 ac c6 05 99 b8 c4 01 01 e8 9e 3f 55 ff <0f> 0b e9 03 ff ff ff 80 3d 87 b8 c4 01 00 0f 85 b8 fe ff ff 48 c7 [ 1.429913] RSP: 0018:ffffae9bc03bfd00 EFLAGS: 00010282 [ 1.430445] RAX: 0000000000000000 RBX: ffffffffad6fbb20 RCX: ffffffffad5636a8 [ 1.431153] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 [ 1.431869] RBP: 0000000028121969 R08: 0000000000000000 R09: 0000000000000000 [ 1.432594] R10: ffffae9bc03bfaa8 R11: 7075727265746e49 R12: ffffae9bc03bfd28 [ 1.433313] R13: ffffffffad471f60 R14: 00000000fee1dead R15: 0000000000000000 [ 1.434021] FS: 00007f77d4a45740(0000) GS:ffff91d0fd600000(0000) knlGS:0000000000000000 [ 1.434815] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.435385] CR2: 00007f7e011e7070 CR3: 00000000012fe001 CR4: 0000000000170ef0 [ 1.436073] Call Trace: [ 1.436334] <TASK> [ 1.436558] ? syscore_resume+0x152/0x180 [ 1.436956] ? __warn.cold+0x93/0xfa [ 1.437319] ? syscore_resume+0x152/0x180 [ 1.437717] ? report_bug+0xff/0x140 [ 1.438075] ? handle_bug+0x58/0x90 [ 1.438438] ? exc_invalid_op+0x17/0x70 [ 1.438826] ? asm_exc_invalid_op+0x1a/0x20 [ 1.439246] ? syscore_resume+0x152/0x180 [ 1.439644] kernel_kexec+0x10a/0x160 [ 1.440010] __do_sys_reboot+0x1fd/0x240 [ 1.440485] do_syscall_64+0x82/0x160 [ 1.440863] ? syscall_exit_to_user_mode+0x10/0x210 [ 1.441351] ? do_syscall_64+0x8e/0x160 [ 1.441735] ? exc_page_fault+0x7e/0x180 [ 1.442123] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 1.442623] RIP: 0033:0x7f77d4b5adb7 [ 1.442992] Code: c7 c0 ff ff ff ff eb be 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 49 50 0c 00 f7 d8 64 89 02 b8 [ 1.444757] RSP: 002b:00007ffc56bc30f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9 [ 1.445493] RAX: ffffffffffffffda RBX: 00007ffc56bc3260 RCX: 00007f77d4b5adb7 [ 1.446173] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead [ 1.446848] RBP: 00007ffc56bc32c0 R08: 000055e1cef3e010 R09: 0000000000000007 [ 1.447527] R10: 000055e1cef41020 R11: 0000000000000246 R12: 0000000000000001 [ 1.448219] R13: 000055e19046b896 R14: 000055e1cef3e4a0 R15: 0000000000000000 [ 1.448893] </TASK> [ 1.449119] ---[ end trace 0000000000000000 ]--- [ 1.452539] Enabling non-boot CPUs ... [ 1.452935] crash hp: kexec_trylock() failed, kdump image may be inaccurate [ 1.453678] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 1.455531] CPU1 is up [ 1.460031] virtio_blk virtio1: 2/0/0 default/read/poll queues [ 1.465246] OOM killer enabled. [ 1.465580] Restarting tasks ... done.
© 2016 - 2026 Red Hat, Inc.