[PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware

Andrew Cooper posted 1 patch 2 years, 5 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20211028232658.20637-1-andrew.cooper3@citrix.com
xen/arch/x86/x86_64/kexec_reloc.S | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
[PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware
Posted by Andrew Cooper 2 years, 5 months ago
The `ljmp *mem` instruction is (famously?) not binary compatible between Intel
and AMD CPUS.  The AMD-compatible version would require .long to be .quad in
the second hunk.

Switch to using lretq, which is compatible between Intel and AMD, as well as
being less logic overall.

Fixes: 5a82d5cf352d ("kexec: extend hypercall with improved load/unload ops")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Ian Jackson <iwj@xenproject.org>

For 4.16.  This is a bugfix for rare (so rare it has probably never been
exercised) but plain-broken usecase.

One argument against taking it says that this has been broken for 8 years
already, so what's a few extra weeks.  Another is that this patch is only
compile tested because I don't have a suitable setup to repro, nor the time to
try organising one.

On the other hand, I specifically used the point of binary incompatibility to
persuade Intel to drop Call Gates out of the architecture in the forthcoming
FRED spec.

The lretq pattern used here matches x86_32_switch() in
xen/arch/x86/boot/head.S, and this codepath is executed on every MB2+EFI
xen.gz boot, which from XenServer alone is a very wide set of testing.
---
 xen/arch/x86/x86_64/kexec_reloc.S | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
index d488d127cfb9..a93f92b19248 100644
--- a/xen/arch/x86/x86_64/kexec_reloc.S
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -86,12 +86,11 @@ call_32_bit:
         movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
         lgdt    compat_mode_gdt_desc(%rip)
 
-        /* Relocate compatibility mode entry point address. */
-        leal    compatibility_mode(%rip), %eax
-        movl    %eax, compatibility_mode_far(%rip)
-
         /* Enter compatibility mode. */
-        ljmp    *compatibility_mode_far(%rip)
+        lea     compatibility_mode(%rip), %rax
+        push    $0x10
+        push    %rax
+        lretq
 
 relocate_pages:
         /* %rdi - indirection page maddr */
@@ -171,10 +170,6 @@ compatibility_mode:
         ud2
 
         .align 4
-compatibility_mode_far:
-        .long 0x00000000             /* set in call_32_bit above */
-        .word 0x0010
-
 compat_mode_gdt_desc:
         .word .Lcompat_mode_gdt_end - compat_mode_gdt -1
         .quad 0x0000000000000000     /* set in call_32_bit above */
-- 
2.11.0


Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware
Posted by Ian Jackson 2 years, 4 months ago
Andrew Cooper writes ("[PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware"):
> The `ljmp *mem` instruction is (famously?) not binary compatible between Intel
> and AMD CPUS.  The AMD-compatible version would require .long to be .quad in
> the second hunk.
> 
> Switch to using lretq, which is compatible between Intel and AMD, as well as
> being less logic overall.
> 
> Fixes: 5a82d5cf352d ("kexec: extend hypercall with improved load/unload ops")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Ian Jackson <iwj@xenproject.org>
> 
> For 4.16.  This is a bugfix for rare (so rare it has probably never been
> exercised) but plain-broken usecase.
> 
> One argument against taking it says that this has been broken for 8 years
> already, so what's a few extra weeks.  Another is that this patch is only
> compile tested because I don't have a suitable setup to repro, nor the time to
> try organising one.

Thanks for being frank about testing.

The bug is a ?race? ?  Which hardly ever happens ?  Or it only affects
some strange configurations ?  Or ... ?

> On the other hand, I specifically used the point of binary incompatibility to
> persuade Intel to drop Call Gates out of the architecture in the forthcoming
> FRED spec.

I'm afraid I can't make head or tail of this.  What are the
implications ?

> The lretq pattern used here matches x86_32_switch() in
> xen/arch/x86/boot/head.S, and this codepath is executed on every MB2+EFI
> xen.gz boot, which from XenServer alone is a very wide set of testing.

AIUI this is an argument saying that the basic principle of this
change is good.  Good.

However: is there some risk of a non-catastrophic breakage here, for
example, if there was a slip in the actual implementation ?
(Catastrophic breakage would break all our tests, I think.)

Thanks,
Ian.

Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware
Posted by Andrew Cooper 2 years, 4 months ago
On 01/11/2021 10:53, Ian Jackson wrote:
> Andrew Cooper writes ("[PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware"):
>> The `ljmp *mem` instruction is (famously?) not binary compatible between Intel
>> and AMD CPUS.  The AMD-compatible version would require .long to be .quad in
>> the second hunk.
>>
>> Switch to using lretq, which is compatible between Intel and AMD, as well as
>> being less logic overall.
>>
>> Fixes: 5a82d5cf352d ("kexec: extend hypercall with improved load/unload ops")
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Wei Liu <wl@xen.org>
>> CC: Ian Jackson <iwj@xenproject.org>
>>
>> For 4.16.  This is a bugfix for rare (so rare it has probably never been
>> exercised) but plain-broken usecase.
>>
>> One argument against taking it says that this has been broken for 8 years
>> already, so what's a few extra weeks.  Another is that this patch is only
>> compile tested because I don't have a suitable setup to repro, nor the time to
>> try organising one.
> Thanks for being frank about testing.
>
> The bug is a ?race? ?  Which hardly ever happens ?  Or it only affects
> some strange configurations ?  Or ... ?

Strange configuration.

On AMD hardware, if you try to use a 32bit crash kernel, then Xen will
unconditionally crash when trying to transition to it.

Any other scenario (Intel hardware, or a 64bit crash kernel) will work
fine and without incident.

>> On the other hand, I specifically used the point of binary incompatibility to
>> persuade Intel to drop Call Gates out of the architecture in the forthcoming
>> FRED spec.
> I'm afraid I can't make head or tail of this.  What are the
> implications ?

I managed to get some CPU architects to agree that there was a binary
incompatibility here.

>> The lretq pattern used here matches x86_32_switch() in
>> xen/arch/x86/boot/head.S, and this codepath is executed on every MB2+EFI
>> xen.gz boot, which from XenServer alone is a very wide set of testing.
> AIUI this is an argument saying that the basic principle of this
> change is good.  Good.
>
> However: is there some risk of a non-catastrophic breakage here, for
> example, if there was a slip in the actual implementation ?
> (Catastrophic breakage would break all our tests, I think.)

This path is only taken for a 32bit crash kernel.  It is not taken for
64bit crash kernels, or they wouldn't work on AMD either, and this is
something we test routinely in XenServer.

The worst that can happen is that I've messed the lretq pattern up, and
broken transition to all 32bit crash kernels, irrespective of hardware
vendor.

It will either function correctly, or explode.  If it is broken, it
won't be subtle, or dependent on the phase of the moon/etc.

~Andrew


Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware
Posted by Ian Jackson 2 years, 4 months ago
Andrew Cooper writes ("Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware"):
> This path is only taken for a 32bit crash kernel.  It is not taken for
> 64bit crash kernels, or they wouldn't work on AMD either, and this is
> something we test routinely in XenServer.
> 
> The worst that can happen is that I've messed the lretq pattern up, and
> broken transition to all 32bit crash kernels, irrespective of hardware
> vendor.
> 
> It will either function correctly, or explode.  If it is broken, it
> won't be subtle, or dependent on the phase of the moon/etc.

Thanks for this confirmation.

Release-Acked-by: Ian Jackson <iwj@xenproject.org>

(NB I'm still working on RC1 so commit moratorium still in force)

Ian.

Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware
Posted by Andrew Cooper 2 years, 4 months ago
On 01/11/2021 12:13, Ian Jackson wrote:
> Andrew Cooper writes ("Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware"):
>> This path is only taken for a 32bit crash kernel.  It is not taken for
>> 64bit crash kernels, or they wouldn't work on AMD either, and this is
>> something we test routinely in XenServer.
>>
>> The worst that can happen is that I've messed the lretq pattern up, and
>> broken transition to all 32bit crash kernels, irrespective of hardware
>> vendor.
>>
>> It will either function correctly, or explode.  If it is broken, it
>> won't be subtle, or dependent on the phase of the moon/etc.
> Thanks for this confirmation.
>
> Release-Acked-by: Ian Jackson <iwj@xenproject.org>

Thanks.

Unfortunately, I've made a blunder here.  The code as implemented is
broken on Intel, and works on AMD.  (I.e. I need to swap Intel and AMD
in the commit message).  Have done locally, but won't repost just for that.

~Andrew

Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware
Posted by Ian Jackson 2 years, 4 months ago
Andrew Cooper writes ("Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware"):
> On 01/11/2021 12:13, Ian Jackson wrote:
> > Andrew Cooper writes ("Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware"):
> >> This path is only taken for a 32bit crash kernel.  It is not taken for
> >> 64bit crash kernels, or they wouldn't work on AMD either, and this is
> >> something we test routinely in XenServer.
> >>
> >> The worst that can happen is that I've messed the lretq pattern up, and
> >> broken transition to all 32bit crash kernels, irrespective of hardware
> >> vendor.
> >>
> >> It will either function correctly, or explode.  If it is broken, it
> >> won't be subtle, or dependent on the phase of the moon/etc.
> > Thanks for this confirmation.
> >
> > Release-Acked-by: Ian Jackson <iwj@xenproject.org>
> 
> Thanks.
> 
> Unfortunately, I've made a blunder here.  The code as implemented is
> broken on Intel, and works on AMD.  (I.e. I need to swap Intel and AMD
> in the commit message).  Have done locally, but won't repost just for that.

OK, thanks.

Ian.

Re: [PATCH] x86/kexec: Fix crash on transition to a 32bit kernel on AMD hardware
Posted by Jan Beulich 2 years, 4 months ago
On 29.10.2021 01:26, Andrew Cooper wrote:
> The `ljmp *mem` instruction is (famously?) not binary compatible between Intel
> and AMD CPUS.  The AMD-compatible version would require .long to be .quad in
> the second hunk.

From all sources I have the incompatibility is only with REX.W: Intel
honors it (allowing a mem64:16) operand, while AMD ignores it (using
the same mem32:16 operand form as without REX.W). All the same as for
L{F,G,S}S. Hence I do not see why the present form of (32-bit) LJMP
would be a problem anywhere.

> Switch to using lretq, which is compatible between Intel and AMD, as well as
> being less logic overall.

I certainly don't mind the switch to LRETQ, but then the reasoning
will imo need to change.

Jan