[v8] Introduce CET supervisor state support

[PATCH v8 0/6] Introduce CET supervisor state support

Posted by Chao Gao 8 months, 3 weeks ago

Dear maintainers and reviewers,

I kindly request your consideration for merging this series. Most of
patches have received Reviewed-by/Acked-by tags.

Thanks Chang, Rick, Xin, Sean and Dave for their help with this series.

== Changelog ==
v7->v8:
 - refine the comment in __fpstate_reset() (Sean)
 - provide helpers to provide default feature masks for host and
   guest FPUs (Sean)
 - v7: https://lore.kernel.org/kvm/20250512085735.564475-1-chao.gao@intel.com/

== Background ==

CET defines two register states: CET user, which includes user-mode control
registers, and CET supervisor, which consists of shadow-stack pointers for
privilege levels 0-2.

Current kernel disables shadow stacks in kernel mode, making the CET
supervisor state unused and eliminating the need for context switching.

== Problem ==

To virtualize CET for guests, KVM must accurately emulate hardware
behavior. A key challenge arises because there is no CPUID flag to indicate
that shadow stack is supported only in user mode. Therefore, KVM cannot
assume guests will not enable shadow stacks in kernel mode and must
preserve the CET supervisor state of vCPUs.

== Solution ==

An initial proposal to manually save and restore CET supervisor states
using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and
its impact on KVM's ABI. Instead, leveraging the kernel's FPU
infrastructure for context switching was favored [1].

The main question then became whether to enable the CET supervisor state
globally for all processes or restrict it to vCPU processes. This decision
involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU
processes and approximately 100 lines of code complexity in the kernel [2].
The agreed approach is to first try this optimal solution [3], i.e.,
restricting the CET supervisor state to guest FPUs only and eliminating
unnecessary space waste.

Key changes in this series are:

1) Fix existing issue regarding enabling guest supervisor states support.
2) Add default features and size for guest FPUs.
3) Add infrastructure to support guest-only features.
4) Add CET supervisor state as the first guest-only feature.

With the series in place, guest FPUs have xstate_bv[12] == xcomp_bv[12] == 1
and CET supervisor state is saved/reloaded when xsaves/xrstors executes on
guest FPUs. non-guest FPUs have xstate_bv[12] == xcomp_bv[12] == 0, then
CET supervisor state is not saved/restored.

== Performance ==

We measured context-switching performance with the benchmark [4] in following
three cases.

case 1: the baseline. i.e., this series isn't applied
case 2: baseline + this series. CET-S space is allocated for guest fpu only.
case 3: baseline + allocate CET-S space for all tasks. Hardware init
        optimization avoids writing out CET-S space on each XSAVES.

The performance differences in the three cases are very small and fall within the
run-to-run variation.

[1]: https://lore.kernel.org/kvm/ZM1jV3UPL0AMpVDI@google.com/
[2]: https://lore.kernel.org/kvm/1c2fd06e-2e97-4724-80ab-8695aa4334e7@intel.com/
[3]: https://lore.kernel.org/kvm/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/
[4]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/context_switch1.c



Chao Gao (4):
  x86/fpu/xstate: Differentiate default features for host and guest FPUs
  x86/fpu: Initialize guest FPU permissions from guest defaults
  x86/fpu: Initialize guest fpstate and FPU pseudo container from guest
    defaults
  x86/fpu: Remove xfd argument from __fpstate_reset()

Yang Weijiang (2):
  x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set
  x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only
    feature

 arch/x86/include/asm/fpu/types.h  | 49 ++++++++++++++++++++++++----
 arch/x86/include/asm/fpu/xstate.h |  9 ++++--
 arch/x86/kernel/fpu/core.c        | 53 +++++++++++++++++++++++--------
 arch/x86/kernel/fpu/init.c        |  1 +
 arch/x86/kernel/fpu/xstate.c      | 40 +++++++++++++++++++----
 5 files changed, 122 insertions(+), 30 deletions(-)


base-commit: 5d7e238ec229cadaeda63b5f96b4ee90bac489e4
-- 
2.47.1

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Chao Gao 8 months, 1 week ago

On Thu, May 22, 2025 at 08:10:03AM -0700, Chao Gao wrote:
>Dear maintainers and reviewers,
>
>I kindly request your consideration for merging this series. Most of
>patches have received Reviewed-by/Acked-by tags.

Looks like we now have AMD RB and the other issue (reported by Sean) was
pre-existing. x86 maintainers, please consider applying.

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Dave Hansen 8 months, 1 week ago

On 6/3/25 17:56, Chao Gao wrote:
> On Thu, May 22, 2025 at 08:10:03AM -0700, Chao Gao wrote:
>> Dear maintainers and reviewers,
>>
>> I kindly request your consideration for merging this series. Most of
>> patches have received Reviewed-by/Acked-by tags.
> Looks like we now have AMD RB and the other issue (reported by Sean) was
> pre-existing. x86 maintainers, please consider applying.

Hi Chao,

You might want to take a look at this:

https://docs.kernel.org/process/maintainer-tip.html#merge-window

Specifically:

> Please do not expect patches to be reviewed or merged by tip
> maintainers around or during the merge window. The trees are closed
> to all but urgent fixes during this time. They reopen once the merge
> window closes and a new -rc1 kernel has been released.

In other words, your mail to ask us to consider applying is going to get
ignored for at least a week. Best case, it gets put on one of our lists
to go look at later.

My suggestion to you and all other submitters of non-critical fixes is
to spend the time between now and -rc1 reviewing *others* code and
making sure yours is 100% ready once -rc1 is released. The week after
the -rc1 release is a great time to send these emails, not now.

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Chao Gao 7 months, 3 weeks ago

On Wed, Jun 04, 2025 at 11:45:53AM -0700, Dave Hansen wrote:
>My suggestion to you and all other submitters of non-critical fixes is
>to spend the time between now and -rc1 reviewing *others* code and
>making sure yours is 100% ready once -rc1 is released. The week after
>the -rc1 release is a great time to send these emails, not now.

Hi Dave,

Thanks for the suggestion. I just tested this series on the latest tip/master.
It applies cleanly and passes all my tests. Please consider merging it if it
looks good to you.

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Sean Christopherson 8 months, 3 weeks ago

On Thu, May 22, 2025, Chao Gao wrote:
> Chao Gao (4):
>   x86/fpu/xstate: Differentiate default features for host and guest FPUs
>   x86/fpu: Initialize guest FPU permissions from guest defaults
>   x86/fpu: Initialize guest fpstate and FPU pseudo container from guest
>     defaults
>   x86/fpu: Remove xfd argument from __fpstate_reset()
> 
> Yang Weijiang (2):
>   x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set
>   x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only
>     feature

Acked-by: Sean Christopherson <seanjc@google.com>

Side topic, and *probably* unrelated to this series, I tripped the following
WARN when running it through the KVM tests (though I don't think it has anything
to do with KVM?).  The WARN is the version of xfd_validate_state() that's guarded
by CONFIG_X86_DEBUG_FPU=y.

   WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70
   Modules linked in: kvm_intel kvm irqbypass vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
   CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan Tainted: G S                  6.15.0-smp--3542d5d75b5c-cet #678 NONE 
   Tainted: [S]=CPU_OUT_OF_SPEC
   Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240807.2-0 10/09/2024
   RIP: 0010:xfd_validate_state+0x65/0x70
   Code: 10 4c 3b 60 18 74 23 49 81 fe 80 c4 45 ab 74 15 4d 0b 7e 08 49 f7 d7 49 85 df 75 0e 5b 41 5c 41 5e 41 5f 5d c3 40 84 ed 75 f2 <0f> 0b eb ee 0f 1f 80 00 00 00 00 66 0f 1f 00 0f 1f 44 00 00 48 89
   RSP: 0018:ff7ada85584a3e08 EFLAGS: 00010246
   RAX: ff2c5d2908a53940 RBX: 00000000000e00ff RCX: ff2c5d2908a53940
   RDX: 0000000000000001 RSI: 00000000000e00ff RDI: ff2c5d2908a521c0
   RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
   R10: ffffffffaa56fa4d R11: 0000000000000000 R12: 0000000000040000
   R13: ff2c5d2908a521c0 R14: ffffffffab45c480 R15: 0000000000000000
   FS:  00007f21084d6700(0000) GS:ff2c5da752b41000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000000 CR3: 00000001ca832006 CR4: 0000000000f73ef0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
   PKRU: 00000000
   Call Trace:
    <TASK>
    fpu__clear_user_states+0x9c/0x100
    arch_do_signal_or_restart+0x142/0x210
    exit_to_user_mode_loop+0x55/0x100
    do_syscall_64+0x205/0x2c0
    entry_SYSCALL_64_after_hwframe+0x4b/0x53
   RIP: 0033:0x55ad185f2ee0
   Code: 8c fc 48 8d 0d 6e d5 8e fc 4c 8d 05 64 cb 78 fc bf 03 00 00 00 ba 25 03 00 00 49 89 c1 31 c0 e8 e6 2e 08 00 cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 49 89 d7 49 89
   RSP: 002b:00007f21084d3e38 EFLAGS: 00000246 ORIG_RAX: 00000000000001b9
   RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000055ad18136f73
   RDX: 00007f21084d3e40 RSI: 00007f21084d3f70 RDI: 000000000000001b
   RBP: 00007f21084d4f90 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
   R13: 000011f03fbc2f00 R14: ffffffffffffffff R15: 0000000000000000
    </TASK>
   irq event stamp: 368
   hardirqs last  enabled at (367): [<ffffffffaaf5f1b8>] _raw_write_unlock_irq+0x28/0x40
   hardirqs last disabled at (368): [<ffffffffaaf5487d>] __schedule+0x1bd/0xea0
   softirqs last  enabled at (0): [<ffffffffaa2aa1ca>] copy_process+0x38a/0x1350
   softirqs last disabled at (0): [<0000000000000000>] 0x0
   ---[ end trace 0000000000000000 ]---

But I've hit the WARN once before, so whatever is going on is pre-existing.  I
haven't done any experiments to see if the WARN fires more frequently with this
series.  I mentioned it here purely out of convenience.

  ------------[ cut here ]------------
  WARNING: CPU: 77 PID: 14821 at arch/x86/kernel/fpu/xstate.c:1466 xfd_validate_state+0x4a/0x50
  Modules linked in: kvm_intel kvm irqbypass vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd sr_mod cdrom loop [last unloaded: kvm]
  CPU: 77 UID: 0 PID: 14821 Comm: futex-default-S Tainted: G S      W           6.15.0-smp--a2104d5ba341-sink #605 NONE 
  Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
  Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240807.2-0 10/09/2024
  RIP: 0010:xfd_validate_state+0x4a/0x50
  Code: 50 0a a9 4d 8b 80 90 17 00 00 49 3b 48 18 74 1a 48 81 ff 80 a4 65 a7 74 0d 48 0b 47 08 48 f7 d0 48 85 f0 75 05 c3 84 d2 75 fb <0f> 0b c3 0f 1f 00 66 0f 1f 00 0f 1f 44 00 00 48 89 f8 48 8b 7f 10
  RSP: 0018:ff1ba89ef124fe58 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffffffffa644871e RCX: 0000000000040000
  RDX: 0000000000000001 RSI: 00000000000600ff RDI: ffffffffa765a480
  RBP: 0000000000000000 R08: ff137abd4db65bc0 R09: 0000000000000000
  R10: ffffffffa6775f8d R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000000 R14: ff137abd4db65b80 R15: 0000000000000000
  FS:  00007fea8bce7700(0000) GS:ff137afc151b3000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 00000001a87c0004 CR4: 0000000000f73ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
  PKRU: 00000000
  Call Trace:
   <TASK>
   fpu__clear_user_states+0x92/0xf0
   arch_do_signal_or_restart+0x134/0x200
   syscall_exit_to_user_mode+0x8a/0x110
   do_syscall_64+0x8b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x563afb5588c0
  Code: f0 e9 df fc ff ff 48 8b 5d 88 4d 89 f0 e9 b5 fe ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 50 49 89 d7 e8 0a 6f f2 01
  RSP: 002b:00007fea8bce4e78 EFLAGS: 00000202 ORIG_RAX: 00000000000000e8
  RAX: 0000000000000000 RBX: 000011fcd50a6dd0 RCX: 0000563af84d6b30
  RDX: 00007fea8bce4e80 RSI: 00007fea8bce4fb0 RDI: 000000000000001e
  RBP: 00007fea8bce5c30 R08: 0000000000000000 R09: 00007fea8bce6ca0
  R10: 00000000000007d0 R11: 0000000000000202 R12: 0000000063239328
  R13: 00000000680b9c83 R14: 00000000000007d0 R15: 000011fcd5c46150
   </TASK>
  irq event stamp: 496018
  hardirqs last  enabled at (496017): [<ffffffffa7158965>] _raw_spin_unlock_irqrestore+0x35/0x50
  hardirqs last disabled at (496018): [<ffffffffa714e6fd>] __schedule+0x1bd/0xe90
  softirqs last  enabled at (495074): [<ffffffffa64c02ec>] __irq_exit_rcu+0x6c/0x130
  softirqs last disabled at (495065): [<ffffffffa64c02ec>] __irq_exit_rcu+0x6c/0x130
  ---[ end trace 0000000000000000 ]---

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Dave Hansen 8 months, 3 weeks ago

On 5/23/25 09:57, Sean Christopherson wrote:
> Side topic, and *probably* unrelated to this series, I tripped the following
> WARN when running it through the KVM tests (though I don't think it has anything
> to do with KVM?).  The WARN is the version of xfd_validate_state() that's guarded
> by CONFIG_X86_DEBUG_FPU=y.
> 
>    WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70

Huh, and the two processes getting hit by it:

   CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan ...
   CPU: 77  UID: 0 PID: 14821 Comm: futex-default-S ...

don't _look_ like KVM test processes. My guess would be it's some
mixture of KVM and a signal handler fighting with XFD state.

I take it this is a Sapphire Rapids system? Is there anything
interesting about the config other than CONFIG_X86_DEBUG_FPU?

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Chao Gao 8 months, 2 weeks ago

On Fri, May 23, 2025 at 10:12:22AM -0700, Dave Hansen wrote:
>On 5/23/25 09:57, Sean Christopherson wrote:
>> Side topic, and *probably* unrelated to this series, I tripped the following
>> WARN when running it through the KVM tests (though I don't think it has anything
>> to do with KVM?).  The WARN is the version of xfd_validate_state() that's guarded
>> by CONFIG_X86_DEBUG_FPU=y.
>> 
>>    WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70
>
>Huh, and the two processes getting hit by it:
>
>   CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan ...
>   CPU: 77  UID: 0 PID: 14821 Comm: futex-default-S ...
>
>don't _look_ like KVM test processes. My guess would be it's some
>mixture of KVM and a signal handler fighting with XFD state.

We are hitting the third case in the table below [*]:

 MSR | fpstate | cur->fpstate | valid
-------------------------------------
   0 |	   0   |	  x   |    1  // MSR matches @fpstate
   0 |	   1   |          0   |    1  // MSR matches cur->fpstate
   0 |	   1   |          1   |    0  <- *** MSR matches nothing!
   1 |	   0   |          0   |    0  <- *** MSR matches nothing!
   1 |	   0   |          1   |    1  // MSR matches cur->fpstate
   1 |	   1   |          x   |    1  // MSR matches @fpstate

*: https://lore.kernel.org/all/88cb75d3-01b9-38ea-e29f-b8fefb548573@intel.com/

The issue arises because the XFD MSR retains the value (i.e., 0, indicating
AMX enabled) from the previous process, while both the passed-in fpstate
(init_fpstate) and the current fpstate have AMX disabled.

To reproduce this issue, compile the kernel with CONFIG_PREEMPT=y, apply the
attached diff to the amx selftest and run:

# numactl -C 1 ./tools/testing/selftests/x86/amx_64

diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
index 40769c16de1b..4d533d1a530d 100644
--- a/tools/testing/selftests/x86/amx.c
+++ b/tools/testing/selftests/x86/amx.c
@@ -430,6 +430,10 @@ static inline void validate_tiledata_regs_changed(struct xsave_buffer *xbuf)
		fatal_error("TILEDATA registers did not change");
 }
 
+static void dummy_handler(int sig)
+{
+}
+
 /* tiledata inheritance test */
 
 static void test_fork(void)
@@ -444,6 +448,10 @@ static void test_fork(void)
		/* fork() succeeded.  Now in the parent. */
		int status;
 
+		req_xtiledata_perm();
+		load_rand_tiledata(stashed_xsave);
+		while(1);
+
		wait(&status);
		if (!WIFEXITED(status) || WEXITSTATUS(status))
			fatal_error("fork test child");
@@ -452,7 +460,9 @@ static void test_fork(void)
	/* fork() succeeded.  Now in the child. */
	printf("[RUN]\tCheck tile data inheritance.\n\tBefore fork(), load tiledata\n");
 
-	load_rand_tiledata(stashed_xsave);
+	signal(SIGSEGV, dummy_handler);
+	while(1)
+		raise(SIGSEGV);
 
	grandchild = fork();
	if (grandchild < 0) {
@@ -500,9 +510,6 @@ int main(void)
 
	test_dynamic_state();
 
-	/* Request permission for the following tests */
-	req_xtiledata_perm();
-
	test_fork();
 
	/*


>
>I take it this is a Sapphire Rapids system? Is there anything
>interesting about the config other than CONFIG_X86_DEBUG_FPU?

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Chang S. Bae 8 months, 1 week ago

On 5/27/2025 4:01 AM, Chao Gao wrote:
> 
> *: https://lore.kernel.org/all/88cb75d3-01b9-38ea-e29f-b8fefb548573@intel.com/
> 
> The issue arises because the XFD MSR retains the value (i.e., 0, indicating
> AMX enabled) from the previous process, while both the passed-in fpstate
> (init_fpstate) and the current fpstate have AMX disabled.
> 
> To reproduce this issue, compile the kernel with CONFIG_PREEMPT=y, apply the
> attached diff to the amx selftest and run:
> 
> # numactl -C 1 ./tools/testing/selftests/x86/amx_64
> 
> diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
> index 40769c16de1b..4d533d1a530d 100644
> --- a/tools/testing/selftests/x86/amx.c
> +++ b/tools/testing/selftests/x86/amx.c
> @@ -430,6 +430,10 @@ static inline void validate_tiledata_regs_changed(struct xsave_buffer *xbuf)
> 		fatal_error("TILEDATA registers did not change");
>   }
>   
> +static void dummy_handler(int sig)
> +{
> +}
> +
>   /* tiledata inheritance test */
>   
>   static void test_fork(void)
> @@ -444,6 +448,10 @@ static void test_fork(void)
> 		/* fork() succeeded.  Now in the parent. */
> 		int status;
>   
> +		req_xtiledata_perm();
> +		load_rand_tiledata(stashed_xsave);
> +		while(1);
> +
> 		wait(&status);
> 		if (!WIFEXITED(status) || WEXITSTATUS(status))
> 			fatal_error("fork test child");
> @@ -452,7 +460,9 @@ static void test_fork(void)
> 	/* fork() succeeded.  Now in the child. */
> 	printf("[RUN]\tCheck tile data inheritance.\n\tBefore fork(), load tiledata\n");
>   
> -	load_rand_tiledata(stashed_xsave);
> +	signal(SIGSEGV, dummy_handler);
> +	while(1)
> +		raise(SIGSEGV);
>   
> 	grandchild = fork();
> 	if (grandchild < 0) {
> @@ -500,9 +510,6 @@ int main(void)
>   
> 	test_dynamic_state();
>   
> -	/* Request permission for the following tests */
> -	req_xtiledata_perm();
> -
> 	test_fork();
>   
> 	/*

The test case creates two processes -- the first uses AMX (task #1), and 
the other continuously sends signals without using AMX (task #2).

This leads to task #2 being preempted by task #1.

This behavior aligns with Sean’s report. From my investigation, the 
issue appears to have existed for quite some time and is not related to 
the changes in this series.

Here’s a summary of my findings:

== Preempt Case ==

To illustrate how the XFD MSR state becomes incorrect in this scenario:

  task #1 (fpstate->xfd=0)  task #2 (fpstate->xfd=0x80000)
  ========================  ==============================
                            handle_signal()
                            -> setup_rt_frame()
                               -> get_siframe()
                                  -> copy_fpstate_to_sigframe()
                                     -> fpregs_unlock()
                                  ...
   ...
   switch_fpu_return()
   -> fpregs_restore_userregs()
      -> restore_fpregs_from_fpstate()
         -> xfd_write_state()
            ^ IA32_XFD_MSR = 0
   ...
                                  ...
                               -> fpu__clear_user_states()
                                  -> fpregs_lock()
                                  -> restore_fpregs_from_init_fpstate()
                                     -> os_rstor()
                                        -> xfd_validate_state()
                                           ^ IA32_XFD_MSR != fpstate->xfd
                                  -> fpregs_mark_active()
                                  -> fpregs_unlock()

Since fpu__clear_user_states() marks the FPU state as valid in the end, 
an XFD MSR sync-up was clearly missing.

== Return-to-Userspace Path ==

Both tasks at that moment are on the return-to-userspace path, but at 
different points in IRQ state:

   * task #2 is inside handle_signal() and already re-enabled IRQs.
   * task #1 is after IRQ is disabled again when calling
     switch_fpu_return().

   local_irq_disable_exit_to_user()
   exit_to_user_mode_prepare()
   -> exit_to_user_mode_loop()
      -> local_irq_enable_exit_to_user()
         -> arch_do_signal_or_restart()
            -> handle_signal()
      -> local_irq_disable_exit_to_user()
   -> arch_exit_user_mode_prepare()
      -> arch_exit_work()
         -> switch_fpu_return()

This implies that fpregs_lock()/fpregs_unlock() is necessary inside 
handle_signal() when XSAVE instructions are invoked.

But, it should be okay for switch_fpu_return() to call 
fpregs_restore_userregs() without fpregs_lock().

== XFD Sanity Checker ==

The XFD sanity checker -- xfd_op_valid() -- correctly caught this issue 
in the test case. However, it may have a false negative when AMX usage 
was flipped between the two tasks.

Despite that, I don't think extending its coverage is worthwhile, as it 
would complicate the logic. The current logic and documentation seem 
sound.

== Fix Consideration ==

I think the fix is straightforward: resynchronize the IA32_XFD MSR in 
fpu__clear_user_states().

The existing xfd_update_state() function is self-contained and already 
performs feature checks and conditional MSR updates. Thus, it is not 
necessary to check the TIF_NEED_FPU_LOAD flag for this.

On the other hand, the sigreturn path already performs the XFD resync 
introduced by this commit:

   672365477ae8a ("x86/fpu: Update XFD state where required")

But I think that change was supposed to cover _both_ signal return and 
signal delivery paths. Sorry, it seems I had overlooked the latter
before.

Thanks,
Chang

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Chao Gao 8 months, 1 week ago

On Mon, Jun 02, 2025 at 12:12:51PM -0700, Chang S. Bae wrote:
>== Preempt Case ==
>
>To illustrate how the XFD MSR state becomes incorrect in this scenario:
>
> task #1 (fpstate->xfd=0)  task #2 (fpstate->xfd=0x80000)
> ========================  ==============================
>                           handle_signal()
>                           -> setup_rt_frame()
>                              -> get_siframe()
>                                 -> copy_fpstate_to_sigframe()
>                                    -> fpregs_unlock()
>                                 ...
>  ...
>  switch_fpu_return()
>  -> fpregs_restore_userregs()
>     -> restore_fpregs_from_fpstate()
>        -> xfd_write_state()
>           ^ IA32_XFD_MSR = 0
>  ...
>                                 ...
>                              -> fpu__clear_user_states()
>                                 -> fpregs_lock()
>                                 -> restore_fpregs_from_init_fpstate()
>                                    -> os_rstor()
>                                       -> xfd_validate_state()
>                                          ^ IA32_XFD_MSR != fpstate->xfd
>                                 -> fpregs_mark_active()
>                                 -> fpregs_unlock()
>
>Since fpu__clear_user_states() marks the FPU state as valid in the end, an
>XFD MSR sync-up was clearly missing.

Thanks for this analysis. It makes sense.

>
>== Return-to-Userspace Path ==
>
>Both tasks at that moment are on the return-to-userspace path, but at
>different points in IRQ state:
>
>  * task #2 is inside handle_signal() and already re-enabled IRQs.
>  * task #1 is after IRQ is disabled again when calling
>    switch_fpu_return().
>
>  local_irq_disable_exit_to_user()
>  exit_to_user_mode_prepare()
>  -> exit_to_user_mode_loop()
>     -> local_irq_enable_exit_to_user()
>        -> arch_do_signal_or_restart()
>           -> handle_signal()
>     -> local_irq_disable_exit_to_user()
>  -> arch_exit_user_mode_prepare()
>     -> arch_exit_work()
>        -> switch_fpu_return()
>
>This implies that fpregs_lock()/fpregs_unlock() is necessary inside
>handle_signal() when XSAVE instructions are invoked.
>
>But, it should be okay for switch_fpu_return() to call
>fpregs_restore_userregs() without fpregs_lock().
>
>== XFD Sanity Checker ==
>
>The XFD sanity checker -- xfd_op_valid() -- correctly caught this issue in
>the test case. However, it may have a false negative when AMX usage was
>flipped between the two tasks.
>
>Despite that, I don't think extending its coverage is worthwhile, as it would
>complicate the logic. The current logic and documentation seem sound.
>
>== Fix Consideration ==
>
>I think the fix is straightforward: resynchronize the IA32_XFD MSR in
>fpu__clear_user_states().

This fix sounds good.

Btw, what do you think the impact of this issue might be?

Aside from the splat, task #2 could execute AMX instructions without
requesting permissions, but its AMX state would be discarded during the
next FPU switch, as RFBM[18] is cleared when executing XSAVES. And, in the
"flipped" scenario you mentioned, task #2 might receive an extra #NM, after
which its fpstate would be re-allocated (although the size won't increase
further).

So, for well-behaved tasks that never use AMX, there is no impact; tasks
that use AMX may receive extra #NM. There won't be any unexpected #PF,
memory corruption, or kernel panic.

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Chang S. Bae 8 months, 1 week ago

On 6/2/2025 11:22 PM, Chao Gao wrote:
> 
> Aside from the splat, task #2 could execute AMX instructions without
> requesting permissions, but its AMX state would be discarded during the
> next FPU switch, as RFBM[18] is cleared when executing XSAVES. And, in the

Right, AMX instructions can be executed when XFD is disarmed. But in 
this case, it's inside a signal handler. On sigreturn, XTILE_DATA will 
be reloaded with the init state, since fpstate::user_xfeatures[18] is zero.

> "flipped" scenario you mentioned, task #2 might receive an extra #NM, after
> which its fpstate would be re-allocated (although the size won't increase
> further).

Yes.

> So, for well-behaved tasks that never use AMX, there is no impact; tasks
> that use AMX may receive extra #NM. There won't be any unexpected #PF,
> memory corruption, or kernel panic.

A signal handler is expected to stay within the bounds of 
async-signal-safe functions, so using AMX in that context is highly 
unlikely in practice. While the issue has existed, its real-world impact 
appears quite minimal in my view.

Thanks,
Chang

Re: [PATCH v8 0/6] Introduce CET supervisor state support

Posted by Sean Christopherson 8 months, 3 weeks ago

On Fri, May 23, 2025, Dave Hansen wrote:
> On 5/23/25 09:57, Sean Christopherson wrote:
> > Side topic, and *probably* unrelated to this series, I tripped the following
> > WARN when running it through the KVM tests (though I don't think it has anything
> > to do with KVM?).  The WARN is the version of xfd_validate_state() that's guarded
> > by CONFIG_X86_DEBUG_FPU=y.
> > 
> >    WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70
> 
> Huh, and the two processes getting hit by it:
> 
>    CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan ...
>    CPU: 77  UID: 0 PID: 14821 Comm: futex-default-S ...
> 
> don't _look_ like KVM test processes.

Yeah, that's why I haven't dug into it, I don't really know where to start, and
I don't even really know what triggered it.

> My guess would be it's some mixture of KVM and a signal handler fighting with
> XFD state.
> 
> I take it this is a Sapphire Rapids system?

Emerald Rapids

> Is there anything interesting about the config other than CONFIG_X86_DEBUG_FPU?

The only thing I can think of that's remotely interesting is CONFIG_PROVE_LOCKING=y.
Other than that, it's a pretty vanilla config.