The original shadow stack support has an error on S3 resume with very bizzare
fallout. The BSP comes back up, but APs fail with:
(XEN) Enabling non-boot CPUs ...
(XEN) Stuck ??
(XEN) Error bringing CPU1 up: -5
and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:
(XEN) d1v0 Unexpected vmexit: reason 3
(XEN) domain_crash called from vmx.c:4304
(XEN) Domain 1 (vcpu#0) crashed on cpu#0:
The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch. It is a
consequence of the APs triple faulting.
The reason the APs triple fault is because we don't tear down the stacks on
suspend. The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.
On resume, SETSSBSY finds the token already busy, suffers #CP and triple
faults because the IDT isn't configured this early.
Rework the AP bringup path to (re)create the supervisor token. This ensures
the primary stack is non-busy before use.
Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Link: https://github.com/QubesOS/qubes-issues/issues/7283
Reported-by: Thiner Logoer <logoerthiner1@163.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Thiner Logoer <logoerthiner1@163.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Thiner Logoer <logoerthiner1@163.com>
CC: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Slightly RFC. This does fix the crash encountered, but it occurs to me that
there's a race condition when S3 platform powerdown is incident with an
NMI/#MC, where more than just the primary shadow stack can end up busy on
resume.
A larger fix would be to change how we allocate tokens, and always have each
CPU set up its own tokens. I didn't do this originally in the hopes of having
WRSSQ generally disabled, but that plan failed when encountering reality...
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index fa41990dde0f..5d12937a0e40 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -51,13 +51,21 @@ ENTRY(__high_start)
test $CET_SHSTK_EN, %al
jz .L_ap_cet_done
- /* Derive MSR_PL0_SSP from %rsp (token written when stack is allocated). */
- mov $MSR_PL0_SSP, %ecx
+ /* Derive the supervisor token address from %rsp. */
mov %rsp, %rdx
+ and $~(STACK_SIZE - 1), %rdx
+ or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
+
+ /*
+ * Write a new supervisor token. Doesn't matter on boot, but for S3
+ * resume this clears the busy bit.
+ */
+ wrssq %rdx, (%rdx)
+
+ /* Point MSR_PL0_SSP at the token. */
+ mov $MSR_PL0_SSP, %ecx
+ mov %edx, %eax
shr $32, %rdx
- mov %esp, %eax
- and $~(STACK_SIZE - 1), %eax
- or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %eax
wrmsr
setssbsy
On 24.02.2022 20:48, Andrew Cooper wrote:
> The original shadow stack support has an error on S3 resume with very bizzare
> fallout. The BSP comes back up, but APs fail with:
>
> (XEN) Enabling non-boot CPUs ...
> (XEN) Stuck ??
> (XEN) Error bringing CPU1 up: -5
>
> and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
> to be scheduled on the BSP dies with:
>
> (XEN) d1v0 Unexpected vmexit: reason 3
> (XEN) domain_crash called from vmx.c:4304
> (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
>
> The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
> scheduled vCPU, and will be addressed in a subsequent patch. It is a
> consequence of the APs triple faulting.
>
> The reason the APs triple fault is because we don't tear down the stacks on
> suspend. The idle/play_dead loop is killed in the middle of running, meaning
> that the supervisor token is left busy.
>
> On resume, SETSSBSY finds the token already busy, suffers #CP and triple
> faults because the IDT isn't configured this early.
>
> Rework the AP bringup path to (re)create the supervisor token. This ensures
> the primary stack is non-busy before use.
>
> Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
> Link: https://github.com/QubesOS/qubes-issues/issues/7283
> Reported-by: Thiner Logoer <logoerthiner1@163.com>
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Tested-by: Thiner Logoer <logoerthiner1@163.com>
> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> Slightly RFC. This does fix the crash encountered, but it occurs to me that
> there's a race condition when S3 platform powerdown is incident with an
> NMI/#MC, where more than just the primary shadow stack can end up busy on
> resume.
>
> A larger fix would be to change how we allocate tokens, and always have each
> CPU set up its own tokens. I didn't do this originally in the hopes of having
> WRSSQ generally disabled, but that plan failed when encountering reality...
While I think this wants fixing one way or another, I also think this
shouldn't block the immediate fix here (which addresses an unconditional
crash rather than a pretty unlikely one).
Jan
On 25/02/2022 08:38, Jan Beulich wrote:
> On 24.02.2022 20:48, Andrew Cooper wrote:
>> The original shadow stack support has an error on S3 resume with very bizzare
>> fallout. The BSP comes back up, but APs fail with:
>>
>> (XEN) Enabling non-boot CPUs ...
>> (XEN) Stuck ??
>> (XEN) Error bringing CPU1 up: -5
>>
>> and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
>> to be scheduled on the BSP dies with:
>>
>> (XEN) d1v0 Unexpected vmexit: reason 3
>> (XEN) domain_crash called from vmx.c:4304
>> (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
>>
>> The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
>> scheduled vCPU, and will be addressed in a subsequent patch. It is a
>> consequence of the APs triple faulting.
>>
>> The reason the APs triple fault is because we don't tear down the stacks on
>> suspend. The idle/play_dead loop is killed in the middle of running, meaning
>> that the supervisor token is left busy.
>>
>> On resume, SETSSBSY finds the token already busy, suffers #CP and triple
>> faults because the IDT isn't configured this early.
>>
>> Rework the AP bringup path to (re)create the supervisor token. This ensures
>> the primary stack is non-busy before use.
>>
>> Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
>> Link: https://github.com/QubesOS/qubes-issues/issues/7283
>> Reported-by: Thiner Logoer <logoerthiner1@163.com>
>> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Tested-by: Thiner Logoer <logoerthiner1@163.com>
>> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>> Slightly RFC. This does fix the crash encountered, but it occurs to me that
>> there's a race condition when S3 platform powerdown is incident with an
>> NMI/#MC, where more than just the primary shadow stack can end up busy on
>> resume.
>>
>> A larger fix would be to change how we allocate tokens, and always have each
>> CPU set up its own tokens. I didn't do this originally in the hopes of having
>> WRSSQ generally disabled, but that plan failed when encountering reality...
> While I think this wants fixing one way or another, I also think this
> shouldn't block the immediate fix here (which addresses an unconditional
> crash rather than a pretty unlikely one).
Fair point. I'll get this committed now, and work on the IST shstks later.
As a note for backporting, this depends on the adjustments made in c/s
311434bfc9d1 so isn't safe to backport in exactly this form. I'll sort
something out in due course.
~Andrew
© 2016 - 2026 Red Hat, Inc.