[v2] x86, fpu/kvm: fix crash with AMX

[PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Paolo Bonzini 1 month, 1 week ago

From: Sean Christopherson <seanjc@google.com>

When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
response to a guest WRMSR, clear XFD-disabled features in the saved (or to
be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
features that are disabled via the guest's XFD.  Because the kernel
executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
will cause XRSTOR to #NM and panic the kernel.

E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV:

  ------------[ cut here ]------------
  WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:exc_device_not_available+0x101/0x110
  Call Trace:
   <TASK>
   asm_exc_device_not_available+0x1a/0x20
  RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
   switch_fpu_return+0x4a/0xb0
   kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm]
   kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
   __x64_sys_ioctl+0x8f/0xd0
   do_syscall_64+0x62/0x940
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  ---[ end trace 0000000000000000 ]---

This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1,
and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
call to fpu_update_guest_xfd().

and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE:

  ------------[ cut here ]------------
  WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:exc_device_not_available+0x101/0x110
  Call Trace:
   <TASK>
   asm_exc_device_not_available+0x1a/0x20
  RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
   fpu_swap_kvm_fpstate+0x6b/0x120
   kvm_load_guest_fpu+0x30/0x80 [kvm]
   kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm]
   kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
   __x64_sys_ioctl+0x8f/0xd0
   do_syscall_64+0x62/0x940
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  ---[ end trace 0000000000000000 ]---

The new behavior is consistent with the AMX architecture.  Per Intel's SDM,
XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD
(and non-compacted XSAVE saves the initial configuration of the state
component):

  If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
  the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
  instead, it operates as if XINUSE[i] = 0 (and the state component was
  in its initial state): it saves bit i of XSTATE_BV field of the XSAVE
  header as 0; in addition, XSAVE saves the initial configuration of the
  state component (the other instructions do not save state component i).

Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using
a constant XFD based on the set of enabled features when XSAVEing for
a struct fpu_guest.  However, having XSTATE_BV[i]=1 for XFD-disabled
features can only happen in the above interrupt case, or in similar
scenarios involving preemption on preemptible kernels, because
fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the
outgoing FPU state with the current XFD; and that is (on all but the
first WRMSR to XFD) the guest XFD.

Therefore, XFD can only go out of sync with XSTATE_BV in the above
interrupt case, or in similar scenarios involving preemption on
preemptible kernels, and it we can consider it (de facto) part of KVM
ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.

Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14)
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate
 to kvm_vcpu_ioctl_x86_set_xsave. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c         |  9 +++++++++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index da233f20ae6f..166c380b0161 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
 #ifdef CONFIG_X86_64
 void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
 {
+	struct fpstate *fpstate = guest_fpu->fpstate;
+
 	fpregs_lock();
-	guest_fpu->fpstate->xfd = xfd;
-	if (guest_fpu->fpstate->in_use)
-		xfd_update_state(guest_fpu->fpstate);
+
+	/*
+	 * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
+	 * the save state to initialized.  Likewise, KVM_GET_XSAVE does the
+	 * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
+	 *
+	 * If the guest's FPU state is in hardware, just update XFD: the XSAVE
+	 * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
+	 *
+	 * If however the guest's FPU state is NOT resident in hardware, clear
+	 * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
+	 * attempt to load disabled components and generate #NM _in the host_.
+	 */
+	if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
+		fpstate->regs.xsave.header.xfeatures &= ~xfd;
+
+	fpstate->xfd = xfd;
+	if (fpstate->in_use)
+		xfd_update_state(fpstate);
+
 	fpregs_unlock();
 }
 EXPORT_SYMBOL_FOR_KVM(fpu_update_guest_xfd);
@@ -430,6 +449,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
 	if (ustate->xsave.header.xfeatures & ~xcr0)
 		return -EINVAL;
 
+	/*
+	 * Disabled features must be in their initial state, otherwise XRSTOR
+	 * causes an exception.
+	 */
+	if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd))
+		return -EINVAL;
+
 	/*
 	 * Nullify @vpkru to preserve its current value if PKRU's bit isn't set
 	 * in the header.  KVM's odd ABI is to leave PKRU untouched in this
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ff8812f3a129..c0416f53b5f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5807,9 +5807,18 @@ static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 					struct kvm_xsave *guest_xsave)
 {
+	union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region;
+
 	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
 		return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0;
 
+	/*
+	 * Do not reject non-initialized disabled features for backwards
+	 * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1.
+	 * Otherwise, XRSTOR would cause a #NM.
+	 */
+	xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd;
+
 	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
 					      guest_xsave->region,
 					      kvm_caps.supported_xcr0,
-- 
2.52.0

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Dave Hansen 3 weeks, 3 days ago

On 1/1/26 01:05, Paolo Bonzini wrote:
> When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
> response to a guest WRMSR, clear XFD-disabled features in the saved (or to
> be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
> features that are disabled via the guest's XFD.  Because the kernel
> executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
> will cause XRSTOR to #NM and panic the kernel.

It would be really nice to see the actual ordering of events here. What
order do the KVM_SET_XSAVE, XFD[$FOO]=1 and kernel_fpu_begin() have to
happen in to trigger this?

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Paolo Bonzini 3 weeks, 3 days ago

On 1/15/26 16:54, Dave Hansen wrote:
> On 1/1/26 01:05, Paolo Bonzini wrote:
>> When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
>> response to a guest WRMSR, clear XFD-disabled features in the saved (or to
>> be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
>> features that are disabled via the guest's XFD.  Because the kernel
>> executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
>> will cause XRSTOR to #NM and panic the kernel.
> 
> It would be really nice to see the actual ordering of events here. What
> order do the KVM_SET_XSAVE, XFD[$FOO]=1 and kernel_fpu_begin() have to
> happen in to trigger this?

The problematic case is described a couple paragraphs below: "This can 
happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1, and 
a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's 
call to fpu_update_guest_xfd()."

Or more in detail:

   Guest running with MSR_IA32_XFD = 0
     WRMSR(MSR_IA32_XFD)
     vmexit
   Host:
     enable IRQ
     interrupt handler
       kernel_fpu_begin() -> sets TIF_NEED_FPU_LOAD
         XSAVE -> stores XINUSE[18] = 1
         ...
       kernel_fpu_end()
     handle vmexit
       fpu_update_guest_xfd() -> XFD[18] = 1
     reenter guest
       fpu_swap_kvm_fpstate()
         XRSTOR -> XINUSE[18] = 1 && XFD[18] = 1 -> #NM and boom

With the patch, fpu_update_guest_xfd() sees TIF_NEED_FPU_LOAD set and 
clears the bit from xinuse.

Paolo

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Dave Hansen 3 weeks, 3 days ago

On 1/15/26 08:22, Paolo Bonzini wrote:
> 
>   Guest running with MSR_IA32_XFD = 0
>     WRMSR(MSR_IA32_XFD)
>     vmexit
>   Host:
>     enable IRQ
>     interrupt handler
>       kernel_fpu_begin() -> sets TIF_NEED_FPU_LOAD
>         XSAVE -> stores XINUSE[18] = 1
>         ...
>       kernel_fpu_end()
>     handle vmexit
>       fpu_update_guest_xfd() -> XFD[18] = 1
>     reenter guest
>       fpu_swap_kvm_fpstate()
>         XRSTOR -> XINUSE[18] = 1 && XFD[18] = 1 -> #NM and boom
> 
> With the patch, fpu_update_guest_xfd() sees TIF_NEED_FPU_LOAD set and
> clears the bit from xinuse.

Paolo, thanks for clarifying that!

Abbreviated, that's just:

	XFD[18]=0
	...
	# Interrupt (that does XSAVE)
	XFD[18]=1
	XRSTOR => #NM

Is there anything preventing the kernel_fpu_begin() interrupt from
happening a little later, say:

	XFD[18]=0
	...
	XFD[18]=1
	# Interrupt (that does XSAVE)
	XRSTOR (no #NM)

In that case, the XSAVE in kernel_fpu_begin() "operates as if XINUSE[i]
= 0" and would set XFEATURES[18]=0; it would save the component as being
in its init state. The later XRSTOR would obviously restore state 18 to
its init state.

Without involving SMIs, I think it lands feature 18 in its init state as
well. The state is _already_ being destroyed in the existing code
without anything exotic needing to happen.

That's a long-winded way of saying I think I agree with the patch. It
destroys the state a bit more aggressively but it doesn't do anything _new_.

What would folks think about making the SDM language stronger, or at
least explicitly adding the language that setting XFD[i]=1 can lead to
XINUSE[i] going from 1=>0. Kinda like the language that's already in
"XRSTOR and the Init and Modified Optimizations", but specific to XFD:

	If XFD[i] = 1 and XINUSE[i] = 1, state component i may be
	tracked as init; XINUSE[i] may be set to 0.

That would make it consistent with the KVM behavior. It might also give
the CPU folks some additional wiggle room for new behavior.

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Chang S. Bae 3 weeks, 3 days ago

On 1/15/2026 10:19 AM, Dave Hansen wrote:
> 
> What would folks think about making the SDM language stronger, or at
> least explicitly adding the language that setting XFD[i]=1 can lead to
> XINUSE[i] going from 1=>0. Kinda like the language that's already in
> "XRSTOR and the Init and Modified Optimizations", but specific to XFD:
> 
> 	If XFD[i] = 1 and XINUSE[i] = 1, state component i may be
> 	tracked as init; XINUSE[i] may be set to 0.
> 
> That would make it consistent with the KVM behavior. It might also give
> the CPU folks some additional wiggle room for new behavior.

Yeah, I saw that you quoted this sentence in the XFD section in your 
other response:

	If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state
	component i, the instruction does not generate #NM when XCR0[i]
	= IA32_XFD[i] = 1; instead, it operates as if XINUSE[i] = 0 (and
	the state component was in its initial state)

Indeed, I do applaud the idea to clarify this behavior more explicitly 
right there.

Thanks,
Chang

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Paolo Bonzini 3 weeks, 3 days ago

On 1/15/26 19:19, Dave Hansen wrote:
> Is there anything preventing the kernel_fpu_begin() interrupt from
> happening a little later, say:
> 
> 	XFD[18]=0
> 	...
> 	XFD[18]=1
> 	# Interrupt (that does XSAVE)
> 	XRSTOR (no #NM)
> 	
> In that case, the XSAVE in kernel_fpu_begin() "operates as if XINUSE[i]
> = 0" and would set XFEATURES[18]=0; it would save the component as being
> in its init state. The later XRSTOR would obviously restore state 18 to
> its init state.

Yes, absolutely, and the fact that the race window is so small is why 
this issue stayed undetected for years.  In fact, consider that XFD 
becomes a pass-through MSR after the first write, at which point there's 
on race window at all---XFD[18] will be 1 if that's the guest value and 
the state will be destroyed.

I only mentioned SMIs as a way for this to happen on bare metal, i.e. 
without KVM involvement at all (though for dual-monitor treatment 
virtualization _is_ involved).

> That's a long-winded way of saying I think I agree with the patch. It
> destroys the state a bit more aggressively but it doesn't do anything _new_.

Thanks. :)

> What would folks think about making the SDM language stronger, or at
> least explicitly adding the language that setting XFD[i]=1 can lead to
> XINUSE[i] going from 1=>0. Kinda like the language that's already in
> "XRSTOR and the Init and Modified Optimizations", but specific to XFD:
> 
> 	If XFD[i] = 1 and XINUSE[i] = 1, state component i may be
> 	tracked as init; XINUSE[i] may be set to 0.
> 
> That would make it consistent with the KVM behavior. It might also give
> the CPU folks some additional wiggle room for new behavior.
Yes, absolutely.  I think any other hypervisor may want to do the same, 
to avoid save/restores of tile data to when guest XFD[18]=1 (and to 
avoid unnecessary clearing of XFD, just for the sake of storing tile 
data that is most likely unused).

Paolo

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Binbin Wu 1 month ago


On 1/1/2026 5:05 PM, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
> response to a guest WRMSR, clear XFD-disabled features in the saved (or to
> be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
> features that are disabled via the guest's XFD.  Because the kernel
> executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
> will cause XRSTOR to #NM and panic the kernel.
> 
> E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV:
> 
>   ------------[ cut here ]------------
>   WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848
>   Modules linked in: kvm_intel kvm irqbypass
>   CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:exc_device_not_available+0x101/0x110
>   Call Trace:
>    <TASK>
>    asm_exc_device_not_available+0x1a/0x20
>   RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
>    switch_fpu_return+0x4a/0xb0
>    kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm]
>    kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
>    __x64_sys_ioctl+0x8f/0xd0
>    do_syscall_64+0x62/0x940
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
> 
> This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1,
> and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
> call to fpu_update_guest_xfd().
> 
> and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE:
> 
>   ------------[ cut here ]------------
>   WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867
>   Modules linked in: kvm_intel kvm irqbypass
>   CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:exc_device_not_available+0x101/0x110
>   Call Trace:
>    <TASK>
>    asm_exc_device_not_available+0x1a/0x20
>   RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
>    fpu_swap_kvm_fpstate+0x6b/0x120
>    kvm_load_guest_fpu+0x30/0x80 [kvm]
>    kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm]
>    kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
>    __x64_sys_ioctl+0x8f/0xd0
>    do_syscall_64+0x62/0x940
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
> 
> The new behavior is consistent with the AMX architecture.  Per Intel's SDM,
> XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD
> (and non-compacted XSAVE saves the initial configuration of the state
> component):
> 
>   If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
>   the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
>   instead, it operates as if XINUSE[i] = 0 (and the state component was
>   in its initial state): it saves bit i of XSTATE_BV field of the XSAVE
>   header as 0; in addition, XSAVE saves the initial configuration of the
>   state component (the other instructions do not save state component i).
> 
> Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using
> a constant XFD based on the set of enabled features when XSAVEing for
> a struct fpu_guest.  However, having XSTATE_BV[i]=1 for XFD-disabled
> features can only happen in the above interrupt case, or in similar
> scenarios involving preemption on preemptible kernels, because
> fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the
> outgoing FPU state with the current XFD; and that is (on all but the
> first WRMSR to XFD) the guest XFD.
> 
> Therefore, XFD can only go out of sync with XSTATE_BV in the above
> interrupt case, or in similar scenarios involving preemption on
> preemptible kernels, and it we can consider it (de facto) part of KVM
> ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

One nit blew.

> 
> Reported-by: Paolo Bonzini <pbonzini@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14)
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate
>  to kvm_vcpu_ioctl_x86_set_xsave. - Paolo]
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++---
>  arch/x86/kvm/x86.c         |  9 +++++++++
>  2 files changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index da233f20ae6f..166c380b0161 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
>  #ifdef CONFIG_X86_64
>  void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
>  {
> +	struct fpstate *fpstate = guest_fpu->fpstate;
> +
>  	fpregs_lock();
> -	guest_fpu->fpstate->xfd = xfd;
> -	if (guest_fpu->fpstate->in_use)
> -		xfd_update_state(guest_fpu->fpstate);
> +
> +	/*
> +	 * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> +	 * the save state to initialized.  Likewise, KVM_GET_XSAVE does the

Nit:
To me "initialized" has the implication that it's active.
I prefer the description "initial state" or "initial configuration" used in
SDM here.
I am not a native English speaker though, please ignore it if it's just my
feeling.


> +	 * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> +	 *
> +	 * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> +	 * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> +	 *
> +	 * If however the guest's FPU state is NOT resident in hardware, clear
> +	 * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> +	 * attempt to load disabled components and generate #NM _in the host_.
> +	 */
> +	if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> +		fpstate->regs.xsave.header.xfeatures &= ~xfd;
> +
> +	fpstate->xfd = xfd;
> +	if (fpstate->in_use)
> +		xfd_update_state(fpstate);
> +
>  	fpregs_unlock();
>  }
>  EXPORT_SYMBOL_FOR_KVM(fpu_update_guest_xfd);
> @@ -430,6 +449,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
>  	if (ustate->xsave.header.xfeatures & ~xcr0)
>  		return -EINVAL;
>  
> +	/*
> +	 * Disabled features must be in their initial state, otherwise XRSTOR
> +	 * causes an exception.
> +	 */
> +	if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd))
> +		return -EINVAL;
> +
>  	/*
>  	 * Nullify @vpkru to preserve its current value if PKRU's bit isn't set
>  	 * in the header.  KVM's odd ABI is to leave PKRU untouched in this
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ff8812f3a129..c0416f53b5f5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5807,9 +5807,18 @@ static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
>  static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
>  					struct kvm_xsave *guest_xsave)
>  {
> +	union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region;
> +
>  	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>  		return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0;
>  
> +	/*
> +	 * Do not reject non-initialized disabled features for backwards
> +	 * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1.
> +	 * Otherwise, XRSTOR would cause a #NM.
> +	 */
> +	xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd;
> +
>  	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
>  					      guest_xsave->region,
>  					      kvm_caps.supported_xcr0,

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Paolo Bonzini 1 month ago

On Thu, Jan 8, 2026 at 4:08 AM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> > +     /*
> > +      * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> > +      * the save state to initialized.  Likewise, KVM_GET_XSAVE does the
>
> Nit:
> To me "initialized" has the implication that it's active.
> I prefer the description "initial state" or "initial configuration" used in
> SDM here.
> I am not a native English speaker though, please ignore it if it's just my
> feeling.

Sure, why not:

   KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert the
   save state to its initial configuration.  Likewise, KVM_GET_XSAVE does
   the same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.

> > +     /*
> > +      * Do not reject non-initialized disabled features for backwards
> > +      * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1.
> > +      * Otherwise, XRSTOR would cause a #NM.
> > +      */

Same here:

   For backwards compatibility, do not expect disabled features to be in
   their initial state.  XSTATE_BV[i] must still be cleared whenever
   XFD[i]=1, or XRSTOR would cause a #NM.

Thanks!

Paolo

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Chang S. Bae 1 month ago

On 1/1/2026 1:05 AM, Paolo Bonzini wrote:
> 
> Therefore, XFD can only go out of sync with XSTATE_BV in the above
> interrupt case, or in similar scenarios involving preemption on

This seems to restate the scenario already described above; I’m not sure
whether the repetition is intentional.

> preemptible kernels, and it we can consider it (de facto) part of KVM
                            ^^^^^
I assume you meant 'we' here though, you might want to slightly rephrase 
it, given the previous debate:

   https://lore.kernel.org/all/87iko54f42.ffs@tglx/

> ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.

On my side, testing on AMX systems, I was able to reproduce the issue 
described and confirm that this patch resolves it:

   Tested-by: Chang S. Bae <chang.seok.bae@intel.com>

The added guards on these paths also look reasonable to me with the 
established KVM ABI. So,

   Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>

Thanks,
Chang

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Paolo Bonzini 1 month ago

On Wed, Jan 7, 2026 at 1:29 AM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>
> On 1/1/2026 1:05 AM, Paolo Bonzini wrote:
> >
> > Therefore, XFD can only go out of sync with XSTATE_BV in the above
> > interrupt case, or in similar scenarios involving preemption on
>
> This seems to restate the scenario already described above; I’m not sure
> whether the repetition is intentional.
>
> > preemptible kernels, and it we can consider it (de facto) part of KVM
>                             ^^^^^
> I assume you meant 'we' here though, you might want to slightly rephrase
> it, given the previous debate:
>
>    https://lore.kernel.org/all/87iko54f42.ffs@tglx/

There are two possible "we"s:

1) the code - in the context of this patch this would be "we force
XSTATE_BV[i] to 0" or "we can be preempted", and I agree it's bad form

2) the community, or the maintainers - this is the case in the commit
message, and I think it's acceptable. While I (Paolo) cannot forcibly
come to your computer and clear XSTATE_BV[i], I certainly can decide
that KVM will do so. :)

> > ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.
>
> On my side, testing on AMX systems, I was able to reproduce the issue
> described and confirm that this patch resolves it:
>
>    Tested-by: Chang S. Bae <chang.seok.bae@intel.com>
>    Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>

Thanks!

Paolo

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Jim Mattson 1 month ago

On Thu, Jan 1, 2026 at 1:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
> ...
> +       /*
> +        * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> +        * the save state to initialized.

This comment suggests that an entry should be added to
Documentation/virt/kvm/x86/errata.rst.

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Sean Christopherson 1 month ago

On Mon, Jan 05, 2026, Jim Mattson wrote:
> On Thu, Jan 1, 2026 at 1:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > From: Sean Christopherson <seanjc@google.com>
> > ...
> > +       /*
> > +        * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> > +        * the save state to initialized.
> 
> This comment suggests that an entry should be added to
> Documentation/virt/kvm/x86/errata.rst.

Hmm, I don't think it's necessary, the SDM (in a style more suited for the APM,
*sigh*), "recommends" that software not rely on state being maintained when disabled
via XFD.

  Before doing so, system software should first initialize AMX state (e.g., by
  executing TILERELEASE); maintaining AMX state in a non-initialized state may
  have negative power and performance implications and will prevent the execution
  of In-Field Scan tests. In addition, software should not rely on the state of
  the tile data after setting IA32_XFD[17] or IA32_XFD[18]; software should always
  reload or reinitialize the tile data after clearing IA32_XFD[17] and IA32_XFD[18].

  System software should not use XFD to implement a “lazy restore” approach to
  management of the TILEDATA state component. This approach will not operate correctly
  for a variety of reasons. One is that the LDTILECFG and TILERELEASE instructions
  initialize TILEDATA and do not cause an #NM exception. Another is that an execution
  of XSAVE, XSAVEC, XSAVEOPT, or XSAVES by a user thread will save TILEDATA as
  initialized instead of the data expected by the user thread.

I suppose that doesn't _quite_ say that the CPU is allowed to clobber state, but
it's darn close.

I'm definitely not opposed to officially documenting KVM's virtual CPU implementation,
but IMO calling it an erratum is a bit unfair.

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Jim Mattson 1 month ago

On Mon, Jan 5, 2026 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jan 05, 2026, Jim Mattson wrote:
> > On Thu, Jan 1, 2026 at 1:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > > From: Sean Christopherson <seanjc@google.com>
> > > ...
> > > +       /*
> > > +        * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> > > +        * the save state to initialized.
> >
> > This comment suggests that an entry should be added to
> > Documentation/virt/kvm/x86/errata.rst.
>
> Hmm, I don't think it's necessary, the SDM (in a style more suited for the APM,
> *sigh*), "recommends" that software not rely on state being maintained when disabled
> via XFD.
>
>   Before doing so, system software should first initialize AMX state (e.g., by
>   executing TILERELEASE); maintaining AMX state in a non-initialized state may
>   have negative power and performance implications and will prevent the execution
>   of In-Field Scan tests. In addition, software should not rely on the state of
>   the tile data after setting IA32_XFD[17] or IA32_XFD[18]; software should always
>   reload or reinitialize the tile data after clearing IA32_XFD[17] and IA32_XFD[18].
>
>   System software should not use XFD to implement a “lazy restore” approach to
>   management of the TILEDATA state component. This approach will not operate correctly
>   for a variety of reasons. One is that the LDTILECFG and TILERELEASE instructions
>   initialize TILEDATA and do not cause an #NM exception. Another is that an execution
>   of XSAVE, XSAVEC, XSAVEOPT, or XSAVES by a user thread will save TILEDATA as
>   initialized instead of the data expected by the user thread.
>
> I suppose that doesn't _quite_ say that the CPU is allowed to clobber state, but
> it's darn close.
>
> I'm definitely not opposed to officially documenting KVM's virtual CPU implementation,
> but IMO calling it an erratum is a bit unfair.

Apologies. You're right. Though Intel is a bit coy, the only way to
interpret that section of the SDM is to conclude that the AMX state in
the CPU becomes undefined when XFD[18] is set.

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Dave Hansen 3 weeks, 3 days ago

On 1/6/26 09:56, Jim Mattson wrote:
> Apologies. You're right. Though Intel is a bit coy, the only way to
> interpret that section of the SDM is to conclude that the AMX state in
> the CPU becomes undefined when XFD[18] is set.

I'll touch base with the folks that wrote that blurb. I'm a little
nervous to interpret that "software should not..." blurb as a full
architectural DANGER sign partly because it's in a "RECOMMENDATIONS FOR
SYSTEM SOFTWARE" section.

I'm _sure_ they discussed tying XFD[i] and XINUSE[i] together and there
was a good reason they did not.

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Paolo Bonzini 3 weeks, 3 days ago

On 1/15/26 17:07, Dave Hansen wrote:
> On 1/6/26 09:56, Jim Mattson wrote:
>> Apologies. You're right. Though Intel is a bit coy, the only way to
>> interpret that section of the SDM is to conclude that the AMX state in
>> the CPU becomes undefined when XFD[18] is set.
> 
> I'll touch base with the folks that wrote that blurb. I'm a little
> nervous to interpret that "software should not..." blurb as a full
> architectural DANGER sign partly because it's in a "RECOMMENDATIONS FOR
> SYSTEM SOFTWARE" section.
> 
> I'm _sure_ they discussed tying XFD[i] and XINUSE[i] together and there
> was a good reason they did not.

Is there anything that prevents an SMM handler (or more likely, an SMI 
transfer monitor) to do an XSAVE/XRSTOR and destroy tile data?

Paolo

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Dave Hansen 3 weeks, 3 days ago

On 1/15/26 08:12, Paolo Bonzini wrote:
...
>> I'm _sure_ they discussed tying XFD[i] and XINUSE[i] together and there
>> was a good reason they did not.
> 
> Is there anything that prevents an SMM handler (or more likely, an SMI
> transfer monitor) to do an XSAVE/XRSTOR and destroy tile data?

I think you're saying: let's assume XFD[18]=1 and XINUSE[18]=1 and
there's an SMI. The SMI handler does:

	XSAVE(RFBM=-1, &buf)
	... run some gunk
	XRSTOR(RFBM=-1, &buf)

to try and save everything. But, that XSAVE is subject to this behavior
from the SDM:

	If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state
	component i, the instruction does not generate #NM when XCR0[i]
	= IA32_XFD[i] = 1; instead, it operates as if XINUSE[i] = 0 (and
	the state component was in its initial state)

So 'buf' will end up having XFEATURES[18]=0. The XRSTOR will see
XFEATURES[18]=0 and set feature 18 to its init state, effectively
zapping its contents.

I guess the only thing preventing that in practice is the lack of XSAVE
use in SMM handlers. But I see your point.

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Yao Yuan 1 month, 1 week ago

On Thu, Jan 01, 2026 at 10:05:13AM +0100, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
> response to a guest WRMSR, clear XFD-disabled features in the saved (or to
> be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
> features that are disabled via the guest's XFD.  Because the kernel
> executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
> will cause XRSTOR to #NM and panic the kernel.
>
> E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV:
>
>   ------------[ cut here ]------------
>   WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848
>   Modules linked in: kvm_intel kvm irqbypass
>   CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:exc_device_not_available+0x101/0x110
>   Call Trace:
>    <TASK>
>    asm_exc_device_not_available+0x1a/0x20
>   RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
>    switch_fpu_return+0x4a/0xb0
>    kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm]
>    kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
>    __x64_sys_ioctl+0x8f/0xd0
>    do_syscall_64+0x62/0x940
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
>
> This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1,
> and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
> call to fpu_update_guest_xfd().
>
> and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE:
>
>   ------------[ cut here ]------------
>   WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867
>   Modules linked in: kvm_intel kvm irqbypass
>   CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:exc_device_not_available+0x101/0x110
>   Call Trace:
>    <TASK>
>    asm_exc_device_not_available+0x1a/0x20
>   RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
>    fpu_swap_kvm_fpstate+0x6b/0x120
>    kvm_load_guest_fpu+0x30/0x80 [kvm]
>    kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm]
>    kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
>    __x64_sys_ioctl+0x8f/0xd0
>    do_syscall_64+0x62/0x940
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
>
> The new behavior is consistent with the AMX architecture.  Per Intel's SDM,
> XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD
> (and non-compacted XSAVE saves the initial configuration of the state
> component):
>
>   If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
>   the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
>   instead, it operates as if XINUSE[i] = 0 (and the state component was
>   in its initial state): it saves bit i of XSTATE_BV field of the XSAVE
>   header as 0; in addition, XSAVE saves the initial configuration of the
>   state component (the other instructions do not save state component i).
>
> Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using
> a constant XFD based on the set of enabled features when XSAVEing for
> a struct fpu_guest.  However, having XSTATE_BV[i]=1 for XFD-disabled
> features can only happen in the above interrupt case, or in similar
> scenarios involving preemption on preemptible kernels, because
> fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the
> outgoing FPU state with the current XFD; and that is (on all but the
> first WRMSR to XFD) the guest XFD.
>
> Therefore, XFD can only go out of sync with XSTATE_BV in the above
> interrupt case, or in similar scenarios involving preemption on
> preemptible kernels, and it we can consider it (de facto) part of KVM
> ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.
>
> Reported-by: Paolo Bonzini <pbonzini@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14)
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate
>  to kvm_vcpu_ioctl_x86_set_xsave. - Paolo]
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++---
>  arch/x86/kvm/x86.c         |  9 +++++++++
>  2 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index da233f20ae6f..166c380b0161 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
>  #ifdef CONFIG_X86_64
>  void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
>  {
> +	struct fpstate *fpstate = guest_fpu->fpstate;
> +
>  	fpregs_lock();
> -	guest_fpu->fpstate->xfd = xfd;
> -	if (guest_fpu->fpstate->in_use)
> -		xfd_update_state(guest_fpu->fpstate);
> +
> +	/*
> +	 * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> +	 * the save state to initialized.  Likewise, KVM_GET_XSAVE does the
> +	 * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> +	 *
> +	 * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> +	 * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> +	 *
> +	 * If however the guest's FPU state is NOT resident in hardware, clear
> +	 * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> +	 * attempt to load disabled components and generate #NM _in the host_.
> +	 */

Hi Sean and Paolo,

> +	if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> +		fpstate->regs.xsave.header.xfeatures &= ~xfd;
> +
> +	fpstate->xfd = xfd;
> +	if (fpstate->in_use)
> +		xfd_update_state(fpstate);

I see a *small* window that the Host IRQ can happen just
after above TIF_NEED_FPU_LOAD checking, which could set
TIF_NEED_FPU_LOAD but w/o clear the xfd from
fpstate->regs.xsave.header.xfeatures.

But there's WARN in in kernel_fpu_begin_mask():

	WARN_ON_FPU(!irq_fpu_usable());

irq_fpu_usable()
{
	...
	/*
	 * In hard interrupt context it's safe when soft interrupts
	 * are enabled, which means the interrupt did not hit in
	 * a fpregs_lock()'ed critical region.
	 */
	return !softirq_count();
}

Looks we are relying on this to catch the above *small* window
yet, we're in fpregs_lock() region yet.

Is this correct understanding ?

> +
>  	fpregs_unlock();
>  }
>  EXPORT_SYMBOL_FOR_KVM(fpu_update_guest_xfd);
> @@ -430,6 +449,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
>  	if (ustate->xsave.header.xfeatures & ~xcr0)
>  		return -EINVAL;
>
> +	/*
> +	 * Disabled features must be in their initial state, otherwise XRSTOR
> +	 * causes an exception.
> +	 */
> +	if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd))
> +		return -EINVAL;
> +
>  	/*
>  	 * Nullify @vpkru to preserve its current value if PKRU's bit isn't set
>  	 * in the header.  KVM's odd ABI is to leave PKRU untouched in this
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ff8812f3a129..c0416f53b5f5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5807,9 +5807,18 @@ static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
>  static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
>  					struct kvm_xsave *guest_xsave)
>  {
> +	union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region;
> +
>  	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>  		return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0;
>
> +	/*
> +	 * Do not reject non-initialized disabled features for backwards
> +	 * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1.
> +	 * Otherwise, XRSTOR would cause a #NM.
> +	 */
> +	xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd;
> +
>  	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
>  					      guest_xsave->region,
>  					      kvm_caps.supported_xcr0,
> --
> 2.52.0
>
>

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Sean Christopherson 1 month ago

On Sat, Jan 03, 2026, Yao Yuan wrote:
> On Thu, Jan 01, 2026 at 10:05:13AM +0100, Paolo Bonzini wrote:
> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > index da233f20ae6f..166c380b0161 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
> >  #ifdef CONFIG_X86_64
> >  void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
> >  {
> > +	struct fpstate *fpstate = guest_fpu->fpstate;
> > +
> >  	fpregs_lock();
> > -	guest_fpu->fpstate->xfd = xfd;
> > -	if (guest_fpu->fpstate->in_use)
> > -		xfd_update_state(guest_fpu->fpstate);
> > +
> > +	/*
> > +	 * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> > +	 * the save state to initialized.  Likewise, KVM_GET_XSAVE does the
> > +	 * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> > +	 *
> > +	 * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> > +	 * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> > +	 *
> > +	 * If however the guest's FPU state is NOT resident in hardware, clear
> > +	 * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> > +	 * attempt to load disabled components and generate #NM _in the host_.
> > +	 */
> 
> Hi Sean and Paolo,
> 
> > +	if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> > +		fpstate->regs.xsave.header.xfeatures &= ~xfd;
> > +
> > +	fpstate->xfd = xfd;
> > +	if (fpstate->in_use)
> > +		xfd_update_state(fpstate);
> 
> I see a *small* window that the Host IRQ can happen just after above
> TIF_NEED_FPU_LOAD checking, which could set TIF_NEED_FPU_LOAD

Only if the code using FPU from IRQ context is buggy.  More below.

> but w/o clear the xfd from fpstate->regs.xsave.header.xfeatures.
> 
> But there's WARN in in kernel_fpu_begin_mask():
> 
> 	WARN_ON_FPU(!irq_fpu_usable());
> 
> irq_fpu_usable()
> {
> 	...
> 	/*
> 	 * In hard interrupt context it's safe when soft interrupts
> 	 * are enabled, which means the interrupt did not hit in
> 	 * a fpregs_lock()'ed critical region.
> 	 */
> 	return !softirq_count();
> }
> 
> Looks we are relying on this to catch the above *small* window
> yet, we're in fpregs_lock() region yet.

Kernel use of FPU from (soft) IRQ context is required to check irq_fpu_usable()
(e.g. via may_use_simd()), i.e. calling fpregs_lock() protects against the kernel
using the FPU and thus setting TIF_NEED_FPU_LOAD.

The WARN in kernel_fpu_begin_mask() is purely a sanity check to help detect and
debug buggy users.

Re: [PATCH 1/4] x86/fpu: Clear XSTATE_BV[i] in save state whenever XFD[i]=1

Posted by Yao Yuan 1 month ago

On Mon, Jan 05, 2026 at 09:31:05AM +0800, Sean Christopherson wrote:
> On Sat, Jan 03, 2026, Yao Yuan wrote:
> > On Thu, Jan 01, 2026 at 10:05:13AM +0100, Paolo Bonzini wrote:
> > > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > > index da233f20ae6f..166c380b0161 100644
> > > --- a/arch/x86/kernel/fpu/core.c
> > > +++ b/arch/x86/kernel/fpu/core.c
> > > @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
> > >  #ifdef CONFIG_X86_64
> > >  void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
> > >  {
> > > +	struct fpstate *fpstate = guest_fpu->fpstate;
> > > +
> > >  	fpregs_lock();
> > > -	guest_fpu->fpstate->xfd = xfd;
> > > -	if (guest_fpu->fpstate->in_use)
> > > -		xfd_update_state(guest_fpu->fpstate);
> > > +
> > > +	/*
> > > +	 * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> > > +	 * the save state to initialized.  Likewise, KVM_GET_XSAVE does the
> > > +	 * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> > > +	 *
> > > +	 * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> > > +	 * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> > > +	 *
> > > +	 * If however the guest's FPU state is NOT resident in hardware, clear
> > > +	 * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> > > +	 * attempt to load disabled components and generate #NM _in the host_.
> > > +	 */
> >
> > Hi Sean and Paolo,
> >
> > > +	if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> > > +		fpstate->regs.xsave.header.xfeatures &= ~xfd;
> > > +
> > > +	fpstate->xfd = xfd;
> > > +	if (fpstate->in_use)
> > > +		xfd_update_state(fpstate);
> >
> > I see a *small* window that the Host IRQ can happen just after above
> > TIF_NEED_FPU_LOAD checking, which could set TIF_NEED_FPU_LOAD
>
> Only if the code using FPU from IRQ context is buggy.  More below.
>
> > but w/o clear the xfd from fpstate->regs.xsave.header.xfeatures.
> >
> > But there's WARN in in kernel_fpu_begin_mask():
> >
> > 	WARN_ON_FPU(!irq_fpu_usable());
> >
> > irq_fpu_usable()
> > {
> > 	...
> > 	/*
> > 	 * In hard interrupt context it's safe when soft interrupts
> > 	 * are enabled, which means the interrupt did not hit in
> > 	 * a fpregs_lock()'ed critical region.
> > 	 */
> > 	return !softirq_count();
> > }
> >
> > Looks we are relying on this to catch the above *small* window
> > yet, we're in fpregs_lock() region yet.
>
> Kernel use of FPU from (soft) IRQ context is required to check irq_fpu_usable()
> (e.g. via may_use_simd()), i.e. calling fpregs_lock() protects against the kernel
> using the FPU and thus setting TIF_NEED_FPU_LOAD.
>
> The WARN in kernel_fpu_begin_mask() is purely a sanity check to help detect and
> debug buggy users.

OK, I have same understanding w/ you, thanks.

Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>