From: Sean Christopherson <seanjc@google.com>
When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
response to a guest WRMSR, clear XFD-disabled features in the saved (or to
be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
features that are disabled via the guest's XFD. Because the kernel
executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
will cause XRSTOR to #NM and panic the kernel.
E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV:
------------[ cut here ]------------
WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848
Modules linked in: kvm_intel kvm irqbypass
CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:exc_device_not_available+0x101/0x110
Call Trace:
<TASK>
asm_exc_device_not_available+0x1a/0x20
RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
switch_fpu_return+0x4a/0xb0
kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm]
kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
__x64_sys_ioctl+0x8f/0xd0
do_syscall_64+0x62/0x940
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1,
and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
call to fpu_update_guest_xfd().
and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE:
------------[ cut here ]------------
WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867
Modules linked in: kvm_intel kvm irqbypass
CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:exc_device_not_available+0x101/0x110
Call Trace:
<TASK>
asm_exc_device_not_available+0x1a/0x20
RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
fpu_swap_kvm_fpstate+0x6b/0x120
kvm_load_guest_fpu+0x30/0x80 [kvm]
kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm]
kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
__x64_sys_ioctl+0x8f/0xd0
do_syscall_64+0x62/0x940
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
The new behavior is consistent with the AMX architecture. Per Intel's SDM,
XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD
(and non-compacted XSAVE saves the initial configuration of the state
component):
If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
instead, it operates as if XINUSE[i] = 0 (and the state component was
in its initial state): it saves bit i of XSTATE_BV field of the XSAVE
header as 0; in addition, XSAVE saves the initial configuration of the
state component (the other instructions do not save state component i).
Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using
a constant XFD based on the set of enabled features when XSAVEing for
a struct fpu_guest. However, having XSTATE_BV[i]=1 for XFD-disabled
features can only happen in the above interrupt case, or in similar
scenarios involving preemption on preemptible kernels, because
fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the
outgoing FPU state with the current XFD; and that is (on all but the
first WRMSR to XFD) the guest XFD.
Therefore, XFD can only go out of sync with XSTATE_BV in the above
interrupt case, or in similar scenarios involving preemption on
preemptible kernels, and it we can consider it (de facto) part of KVM
ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14)
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate
to kvm_vcpu_ioctl_x86_set_xsave. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++---
arch/x86/kvm/x86.c | 9 +++++++++
2 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index da233f20ae6f..166c380b0161 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
#ifdef CONFIG_X86_64
void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
{
+ struct fpstate *fpstate = guest_fpu->fpstate;
+
fpregs_lock();
- guest_fpu->fpstate->xfd = xfd;
- if (guest_fpu->fpstate->in_use)
- xfd_update_state(guest_fpu->fpstate);
+
+ /*
+ * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
+ * the save state to initialized. Likewise, KVM_GET_XSAVE does the
+ * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
+ *
+ * If the guest's FPU state is in hardware, just update XFD: the XSAVE
+ * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
+ *
+ * If however the guest's FPU state is NOT resident in hardware, clear
+ * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
+ * attempt to load disabled components and generate #NM _in the host_.
+ */
+ if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
+ fpstate->regs.xsave.header.xfeatures &= ~xfd;
+
+ fpstate->xfd = xfd;
+ if (fpstate->in_use)
+ xfd_update_state(fpstate);
+
fpregs_unlock();
}
EXPORT_SYMBOL_FOR_KVM(fpu_update_guest_xfd);
@@ -430,6 +449,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
if (ustate->xsave.header.xfeatures & ~xcr0)
return -EINVAL;
+ /*
+ * Disabled features must be in their initial state, otherwise XRSTOR
+ * causes an exception.
+ */
+ if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd))
+ return -EINVAL;
+
/*
* Nullify @vpkru to preserve its current value if PKRU's bit isn't set
* in the header. KVM's odd ABI is to leave PKRU untouched in this
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ff8812f3a129..c0416f53b5f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5807,9 +5807,18 @@ static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
{
+ union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region;
+
if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0;
+ /*
+ * Do not reject non-initialized disabled features for backwards
+ * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1.
+ * Otherwise, XRSTOR would cause a #NM.
+ */
+ xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd;
+
return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
guest_xsave->region,
kvm_caps.supported_xcr0,
--
2.52.0
On 1/1/26 01:05, Paolo Bonzini wrote: > When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in > response to a guest WRMSR, clear XFD-disabled features in the saved (or to > be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for > features that are disabled via the guest's XFD. Because the kernel > executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1 > will cause XRSTOR to #NM and panic the kernel. It would be really nice to see the actual ordering of events here. What order do the KVM_SET_XSAVE, XFD[$FOO]=1 and kernel_fpu_begin() have to happen in to trigger this?
On 1/15/26 16:54, Dave Hansen wrote:
> On 1/1/26 01:05, Paolo Bonzini wrote:
>> When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
>> response to a guest WRMSR, clear XFD-disabled features in the saved (or to
>> be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
>> features that are disabled via the guest's XFD. Because the kernel
>> executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
>> will cause XRSTOR to #NM and panic the kernel.
>
> It would be really nice to see the actual ordering of events here. What
> order do the KVM_SET_XSAVE, XFD[$FOO]=1 and kernel_fpu_begin() have to
> happen in to trigger this?
The problematic case is described a couple paragraphs below: "This can
happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1, and
a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
call to fpu_update_guest_xfd()."
Or more in detail:
Guest running with MSR_IA32_XFD = 0
WRMSR(MSR_IA32_XFD)
vmexit
Host:
enable IRQ
interrupt handler
kernel_fpu_begin() -> sets TIF_NEED_FPU_LOAD
XSAVE -> stores XINUSE[18] = 1
...
kernel_fpu_end()
handle vmexit
fpu_update_guest_xfd() -> XFD[18] = 1
reenter guest
fpu_swap_kvm_fpstate()
XRSTOR -> XINUSE[18] = 1 && XFD[18] = 1 -> #NM and boom
With the patch, fpu_update_guest_xfd() sees TIF_NEED_FPU_LOAD set and
clears the bit from xinuse.
Paolo
On 1/15/26 08:22, Paolo Bonzini wrote: > > Guest running with MSR_IA32_XFD = 0 > WRMSR(MSR_IA32_XFD) > vmexit > Host: > enable IRQ > interrupt handler > kernel_fpu_begin() -> sets TIF_NEED_FPU_LOAD > XSAVE -> stores XINUSE[18] = 1 > ... > kernel_fpu_end() > handle vmexit > fpu_update_guest_xfd() -> XFD[18] = 1 > reenter guest > fpu_swap_kvm_fpstate() > XRSTOR -> XINUSE[18] = 1 && XFD[18] = 1 -> #NM and boom > > With the patch, fpu_update_guest_xfd() sees TIF_NEED_FPU_LOAD set and > clears the bit from xinuse. Paolo, thanks for clarifying that! Abbreviated, that's just: XFD[18]=0 ... # Interrupt (that does XSAVE) XFD[18]=1 XRSTOR => #NM Is there anything preventing the kernel_fpu_begin() interrupt from happening a little later, say: XFD[18]=0 ... XFD[18]=1 # Interrupt (that does XSAVE) XRSTOR (no #NM) In that case, the XSAVE in kernel_fpu_begin() "operates as if XINUSE[i] = 0" and would set XFEATURES[18]=0; it would save the component as being in its init state. The later XRSTOR would obviously restore state 18 to its init state. Without involving SMIs, I think it lands feature 18 in its init state as well. The state is _already_ being destroyed in the existing code without anything exotic needing to happen. That's a long-winded way of saying I think I agree with the patch. It destroys the state a bit more aggressively but it doesn't do anything _new_. What would folks think about making the SDM language stronger, or at least explicitly adding the language that setting XFD[i]=1 can lead to XINUSE[i] going from 1=>0. Kinda like the language that's already in "XRSTOR and the Init and Modified Optimizations", but specific to XFD: If XFD[i] = 1 and XINUSE[i] = 1, state component i may be tracked as init; XINUSE[i] may be set to 0. That would make it consistent with the KVM behavior. It might also give the CPU folks some additional wiggle room for new behavior.
On 1/15/2026 10:19 AM, Dave Hansen wrote: > > What would folks think about making the SDM language stronger, or at > least explicitly adding the language that setting XFD[i]=1 can lead to > XINUSE[i] going from 1=>0. Kinda like the language that's already in > "XRSTOR and the Init and Modified Optimizations", but specific to XFD: > > If XFD[i] = 1 and XINUSE[i] = 1, state component i may be > tracked as init; XINUSE[i] may be set to 0. > > That would make it consistent with the KVM behavior. It might also give > the CPU folks some additional wiggle room for new behavior. Yeah, I saw that you quoted this sentence in the XFD section in your other response: If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i, the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1; instead, it operates as if XINUSE[i] = 0 (and the state component was in its initial state) Indeed, I do applaud the idea to clarify this behavior more explicitly right there. Thanks, Chang
On 1/15/26 19:19, Dave Hansen wrote: > Is there anything preventing the kernel_fpu_begin() interrupt from > happening a little later, say: > > XFD[18]=0 > ... > XFD[18]=1 > # Interrupt (that does XSAVE) > XRSTOR (no #NM) > > In that case, the XSAVE in kernel_fpu_begin() "operates as if XINUSE[i] > = 0" and would set XFEATURES[18]=0; it would save the component as being > in its init state. The later XRSTOR would obviously restore state 18 to > its init state. Yes, absolutely, and the fact that the race window is so small is why this issue stayed undetected for years. In fact, consider that XFD becomes a pass-through MSR after the first write, at which point there's on race window at all---XFD[18] will be 1 if that's the guest value and the state will be destroyed. I only mentioned SMIs as a way for this to happen on bare metal, i.e. without KVM involvement at all (though for dual-monitor treatment virtualization _is_ involved). > That's a long-winded way of saying I think I agree with the patch. It > destroys the state a bit more aggressively but it doesn't do anything _new_. Thanks. :) > What would folks think about making the SDM language stronger, or at > least explicitly adding the language that setting XFD[i]=1 can lead to > XINUSE[i] going from 1=>0. Kinda like the language that's already in > "XRSTOR and the Init and Modified Optimizations", but specific to XFD: > > If XFD[i] = 1 and XINUSE[i] = 1, state component i may be > tracked as init; XINUSE[i] may be set to 0. > > That would make it consistent with the KVM behavior. It might also give > the CPU folks some additional wiggle room for new behavior. Yes, absolutely. I think any other hypervisor may want to do the same, to avoid save/restores of tile data to when guest XFD[18]=1 (and to avoid unnecessary clearing of XFD, just for the sake of storing tile data that is most likely unused). Paolo
On 1/1/2026 5:05 PM, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
> response to a guest WRMSR, clear XFD-disabled features in the saved (or to
> be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
> features that are disabled via the guest's XFD. Because the kernel
> executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
> will cause XRSTOR to #NM and panic the kernel.
>
> E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV:
>
> ------------[ cut here ]------------
> WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848
> Modules linked in: kvm_intel kvm irqbypass
> CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:exc_device_not_available+0x101/0x110
> Call Trace:
> <TASK>
> asm_exc_device_not_available+0x1a/0x20
> RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
> switch_fpu_return+0x4a/0xb0
> kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm]
> kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
> __x64_sys_ioctl+0x8f/0xd0
> do_syscall_64+0x62/0x940
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> </TASK>
> ---[ end trace 0000000000000000 ]---
>
> This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1,
> and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
> call to fpu_update_guest_xfd().
>
> and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE:
>
> ------------[ cut here ]------------
> WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867
> Modules linked in: kvm_intel kvm irqbypass
> CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:exc_device_not_available+0x101/0x110
> Call Trace:
> <TASK>
> asm_exc_device_not_available+0x1a/0x20
> RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
> fpu_swap_kvm_fpstate+0x6b/0x120
> kvm_load_guest_fpu+0x30/0x80 [kvm]
> kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm]
> kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
> __x64_sys_ioctl+0x8f/0xd0
> do_syscall_64+0x62/0x940
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> </TASK>
> ---[ end trace 0000000000000000 ]---
>
> The new behavior is consistent with the AMX architecture. Per Intel's SDM,
> XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD
> (and non-compacted XSAVE saves the initial configuration of the state
> component):
>
> If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
> the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
> instead, it operates as if XINUSE[i] = 0 (and the state component was
> in its initial state): it saves bit i of XSTATE_BV field of the XSAVE
> header as 0; in addition, XSAVE saves the initial configuration of the
> state component (the other instructions do not save state component i).
>
> Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using
> a constant XFD based on the set of enabled features when XSAVEing for
> a struct fpu_guest. However, having XSTATE_BV[i]=1 for XFD-disabled
> features can only happen in the above interrupt case, or in similar
> scenarios involving preemption on preemptible kernels, because
> fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the
> outgoing FPU state with the current XFD; and that is (on all but the
> first WRMSR to XFD) the guest XFD.
>
> Therefore, XFD can only go out of sync with XSTATE_BV in the above
> interrupt case, or in similar scenarios involving preemption on
> preemptible kernels, and it we can consider it (de facto) part of KVM
> ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
One nit blew.
>
> Reported-by: Paolo Bonzini <pbonzini@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14)
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate
> to kvm_vcpu_ioctl_x86_set_xsave. - Paolo]
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++---
> arch/x86/kvm/x86.c | 9 +++++++++
> 2 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index da233f20ae6f..166c380b0161 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
> #ifdef CONFIG_X86_64
> void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
> {
> + struct fpstate *fpstate = guest_fpu->fpstate;
> +
> fpregs_lock();
> - guest_fpu->fpstate->xfd = xfd;
> - if (guest_fpu->fpstate->in_use)
> - xfd_update_state(guest_fpu->fpstate);
> +
> + /*
> + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> + * the save state to initialized. Likewise, KVM_GET_XSAVE does the
Nit:
To me "initialized" has the implication that it's active.
I prefer the description "initial state" or "initial configuration" used in
SDM here.
I am not a native English speaker though, please ignore it if it's just my
feeling.
> + * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> + *
> + * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> + * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> + *
> + * If however the guest's FPU state is NOT resident in hardware, clear
> + * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> + * attempt to load disabled components and generate #NM _in the host_.
> + */
> + if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> + fpstate->regs.xsave.header.xfeatures &= ~xfd;
> +
> + fpstate->xfd = xfd;
> + if (fpstate->in_use)
> + xfd_update_state(fpstate);
> +
> fpregs_unlock();
> }
> EXPORT_SYMBOL_FOR_KVM(fpu_update_guest_xfd);
> @@ -430,6 +449,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
> if (ustate->xsave.header.xfeatures & ~xcr0)
> return -EINVAL;
>
> + /*
> + * Disabled features must be in their initial state, otherwise XRSTOR
> + * causes an exception.
> + */
> + if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd))
> + return -EINVAL;
> +
> /*
> * Nullify @vpkru to preserve its current value if PKRU's bit isn't set
> * in the header. KVM's odd ABI is to leave PKRU untouched in this
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ff8812f3a129..c0416f53b5f5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5807,9 +5807,18 @@ static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> struct kvm_xsave *guest_xsave)
> {
> + union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region;
> +
> if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0;
>
> + /*
> + * Do not reject non-initialized disabled features for backwards
> + * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1.
> + * Otherwise, XRSTOR would cause a #NM.
> + */
> + xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd;
> +
> return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
> guest_xsave->region,
> kvm_caps.supported_xcr0,
On Thu, Jan 8, 2026 at 4:08 AM Binbin Wu <binbin.wu@linux.intel.com> wrote: > > + /* > > + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert > > + * the save state to initialized. Likewise, KVM_GET_XSAVE does the > > Nit: > To me "initialized" has the implication that it's active. > I prefer the description "initial state" or "initial configuration" used in > SDM here. > I am not a native English speaker though, please ignore it if it's just my > feeling. Sure, why not: KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert the save state to its initial configuration. Likewise, KVM_GET_XSAVE does the same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1. > > + /* > > + * Do not reject non-initialized disabled features for backwards > > + * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1. > > + * Otherwise, XRSTOR would cause a #NM. > > + */ Same here: For backwards compatibility, do not expect disabled features to be in their initial state. XSTATE_BV[i] must still be cleared whenever XFD[i]=1, or XRSTOR would cause a #NM. Thanks! Paolo
On 1/1/2026 1:05 AM, Paolo Bonzini wrote:
>
> Therefore, XFD can only go out of sync with XSTATE_BV in the above
> interrupt case, or in similar scenarios involving preemption on
This seems to restate the scenario already described above; I’m not sure
whether the repetition is intentional.
> preemptible kernels, and it we can consider it (de facto) part of KVM
^^^^^
I assume you meant 'we' here though, you might want to slightly rephrase
it, given the previous debate:
https://lore.kernel.org/all/87iko54f42.ffs@tglx/
> ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.
On my side, testing on AMX systems, I was able to reproduce the issue
described and confirm that this patch resolves it:
Tested-by: Chang S. Bae <chang.seok.bae@intel.com>
The added guards on these paths also look reasonable to me with the
established KVM ABI. So,
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Thanks,
Chang
On Wed, Jan 7, 2026 at 1:29 AM Chang S. Bae <chang.seok.bae@intel.com> wrote: > > On 1/1/2026 1:05 AM, Paolo Bonzini wrote: > > > > Therefore, XFD can only go out of sync with XSTATE_BV in the above > > interrupt case, or in similar scenarios involving preemption on > > This seems to restate the scenario already described above; I’m not sure > whether the repetition is intentional. > > > preemptible kernels, and it we can consider it (de facto) part of KVM > ^^^^^ > I assume you meant 'we' here though, you might want to slightly rephrase > it, given the previous debate: > > https://lore.kernel.org/all/87iko54f42.ffs@tglx/ There are two possible "we"s: 1) the code - in the context of this patch this would be "we force XSTATE_BV[i] to 0" or "we can be preempted", and I agree it's bad form 2) the community, or the maintainers - this is the case in the commit message, and I think it's acceptable. While I (Paolo) cannot forcibly come to your computer and clear XSTATE_BV[i], I certainly can decide that KVM will do so. :) > > ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features. > > On my side, testing on AMX systems, I was able to reproduce the issue > described and confirm that this patch resolves it: > > Tested-by: Chang S. Bae <chang.seok.bae@intel.com> > Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Thanks! Paolo
On Thu, Jan 1, 2026 at 1:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > From: Sean Christopherson <seanjc@google.com> > ... > + /* > + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert > + * the save state to initialized. This comment suggests that an entry should be added to Documentation/virt/kvm/x86/errata.rst.
On Mon, Jan 05, 2026, Jim Mattson wrote: > On Thu, Jan 1, 2026 at 1:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > From: Sean Christopherson <seanjc@google.com> > > ... > > + /* > > + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert > > + * the save state to initialized. > > This comment suggests that an entry should be added to > Documentation/virt/kvm/x86/errata.rst. Hmm, I don't think it's necessary, the SDM (in a style more suited for the APM, *sigh*), "recommends" that software not rely on state being maintained when disabled via XFD. Before doing so, system software should first initialize AMX state (e.g., by executing TILERELEASE); maintaining AMX state in a non-initialized state may have negative power and performance implications and will prevent the execution of In-Field Scan tests. In addition, software should not rely on the state of the tile data after setting IA32_XFD[17] or IA32_XFD[18]; software should always reload or reinitialize the tile data after clearing IA32_XFD[17] and IA32_XFD[18]. System software should not use XFD to implement a “lazy restore” approach to management of the TILEDATA state component. This approach will not operate correctly for a variety of reasons. One is that the LDTILECFG and TILERELEASE instructions initialize TILEDATA and do not cause an #NM exception. Another is that an execution of XSAVE, XSAVEC, XSAVEOPT, or XSAVES by a user thread will save TILEDATA as initialized instead of the data expected by the user thread. I suppose that doesn't _quite_ say that the CPU is allowed to clobber state, but it's darn close. I'm definitely not opposed to officially documenting KVM's virtual CPU implementation, but IMO calling it an erratum is a bit unfair.
On Mon, Jan 5, 2026 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Jan 05, 2026, Jim Mattson wrote: > > On Thu, Jan 1, 2026 at 1:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > From: Sean Christopherson <seanjc@google.com> > > > ... > > > + /* > > > + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert > > > + * the save state to initialized. > > > > This comment suggests that an entry should be added to > > Documentation/virt/kvm/x86/errata.rst. > > Hmm, I don't think it's necessary, the SDM (in a style more suited for the APM, > *sigh*), "recommends" that software not rely on state being maintained when disabled > via XFD. > > Before doing so, system software should first initialize AMX state (e.g., by > executing TILERELEASE); maintaining AMX state in a non-initialized state may > have negative power and performance implications and will prevent the execution > of In-Field Scan tests. In addition, software should not rely on the state of > the tile data after setting IA32_XFD[17] or IA32_XFD[18]; software should always > reload or reinitialize the tile data after clearing IA32_XFD[17] and IA32_XFD[18]. > > System software should not use XFD to implement a “lazy restore” approach to > management of the TILEDATA state component. This approach will not operate correctly > for a variety of reasons. One is that the LDTILECFG and TILERELEASE instructions > initialize TILEDATA and do not cause an #NM exception. Another is that an execution > of XSAVE, XSAVEC, XSAVEOPT, or XSAVES by a user thread will save TILEDATA as > initialized instead of the data expected by the user thread. > > I suppose that doesn't _quite_ say that the CPU is allowed to clobber state, but > it's darn close. > > I'm definitely not opposed to officially documenting KVM's virtual CPU implementation, > but IMO calling it an erratum is a bit unfair. Apologies. You're right. Though Intel is a bit coy, the only way to interpret that section of the SDM is to conclude that the AMX state in the CPU becomes undefined when XFD[18] is set.
On 1/6/26 09:56, Jim Mattson wrote: > Apologies. You're right. Though Intel is a bit coy, the only way to > interpret that section of the SDM is to conclude that the AMX state in > the CPU becomes undefined when XFD[18] is set. I'll touch base with the folks that wrote that blurb. I'm a little nervous to interpret that "software should not..." blurb as a full architectural DANGER sign partly because it's in a "RECOMMENDATIONS FOR SYSTEM SOFTWARE" section. I'm _sure_ they discussed tying XFD[i] and XINUSE[i] together and there was a good reason they did not.
On 1/15/26 17:07, Dave Hansen wrote: > On 1/6/26 09:56, Jim Mattson wrote: >> Apologies. You're right. Though Intel is a bit coy, the only way to >> interpret that section of the SDM is to conclude that the AMX state in >> the CPU becomes undefined when XFD[18] is set. > > I'll touch base with the folks that wrote that blurb. I'm a little > nervous to interpret that "software should not..." blurb as a full > architectural DANGER sign partly because it's in a "RECOMMENDATIONS FOR > SYSTEM SOFTWARE" section. > > I'm _sure_ they discussed tying XFD[i] and XINUSE[i] together and there > was a good reason they did not. Is there anything that prevents an SMM handler (or more likely, an SMI transfer monitor) to do an XSAVE/XRSTOR and destroy tile data? Paolo
On 1/15/26 08:12, Paolo Bonzini wrote: ... >> I'm _sure_ they discussed tying XFD[i] and XINUSE[i] together and there >> was a good reason they did not. > > Is there anything that prevents an SMM handler (or more likely, an SMI > transfer monitor) to do an XSAVE/XRSTOR and destroy tile data? I think you're saying: let's assume XFD[18]=1 and XINUSE[18]=1 and there's an SMI. The SMI handler does: XSAVE(RFBM=-1, &buf) ... run some gunk XRSTOR(RFBM=-1, &buf) to try and save everything. But, that XSAVE is subject to this behavior from the SDM: If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i, the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1; instead, it operates as if XINUSE[i] = 0 (and the state component was in its initial state) So 'buf' will end up having XFEATURES[18]=0. The XRSTOR will see XFEATURES[18]=0 and set feature 18 to its init state, effectively zapping its contents. I guess the only thing preventing that in practice is the lack of XSAVE use in SMM handlers. But I see your point.
On Thu, Jan 01, 2026 at 10:05:13AM +0100, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
> response to a guest WRMSR, clear XFD-disabled features in the saved (or to
> be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
> features that are disabled via the guest's XFD. Because the kernel
> executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
> will cause XRSTOR to #NM and panic the kernel.
>
> E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV:
>
> ------------[ cut here ]------------
> WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848
> Modules linked in: kvm_intel kvm irqbypass
> CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:exc_device_not_available+0x101/0x110
> Call Trace:
> <TASK>
> asm_exc_device_not_available+0x1a/0x20
> RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
> switch_fpu_return+0x4a/0xb0
> kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm]
> kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
> __x64_sys_ioctl+0x8f/0xd0
> do_syscall_64+0x62/0x940
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> </TASK>
> ---[ end trace 0000000000000000 ]---
>
> This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1,
> and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
> call to fpu_update_guest_xfd().
>
> and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE:
>
> ------------[ cut here ]------------
> WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867
> Modules linked in: kvm_intel kvm irqbypass
> CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:exc_device_not_available+0x101/0x110
> Call Trace:
> <TASK>
> asm_exc_device_not_available+0x1a/0x20
> RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
> fpu_swap_kvm_fpstate+0x6b/0x120
> kvm_load_guest_fpu+0x30/0x80 [kvm]
> kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm]
> kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
> __x64_sys_ioctl+0x8f/0xd0
> do_syscall_64+0x62/0x940
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> </TASK>
> ---[ end trace 0000000000000000 ]---
>
> The new behavior is consistent with the AMX architecture. Per Intel's SDM,
> XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD
> (and non-compacted XSAVE saves the initial configuration of the state
> component):
>
> If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
> the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
> instead, it operates as if XINUSE[i] = 0 (and the state component was
> in its initial state): it saves bit i of XSTATE_BV field of the XSAVE
> header as 0; in addition, XSAVE saves the initial configuration of the
> state component (the other instructions do not save state component i).
>
> Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using
> a constant XFD based on the set of enabled features when XSAVEing for
> a struct fpu_guest. However, having XSTATE_BV[i]=1 for XFD-disabled
> features can only happen in the above interrupt case, or in similar
> scenarios involving preemption on preemptible kernels, because
> fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the
> outgoing FPU state with the current XFD; and that is (on all but the
> first WRMSR to XFD) the guest XFD.
>
> Therefore, XFD can only go out of sync with XSTATE_BV in the above
> interrupt case, or in similar scenarios involving preemption on
> preemptible kernels, and it we can consider it (de facto) part of KVM
> ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.
>
> Reported-by: Paolo Bonzini <pbonzini@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14)
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate
> to kvm_vcpu_ioctl_x86_set_xsave. - Paolo]
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++---
> arch/x86/kvm/x86.c | 9 +++++++++
> 2 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index da233f20ae6f..166c380b0161 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
> #ifdef CONFIG_X86_64
> void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
> {
> + struct fpstate *fpstate = guest_fpu->fpstate;
> +
> fpregs_lock();
> - guest_fpu->fpstate->xfd = xfd;
> - if (guest_fpu->fpstate->in_use)
> - xfd_update_state(guest_fpu->fpstate);
> +
> + /*
> + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> + * the save state to initialized. Likewise, KVM_GET_XSAVE does the
> + * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> + *
> + * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> + * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> + *
> + * If however the guest's FPU state is NOT resident in hardware, clear
> + * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> + * attempt to load disabled components and generate #NM _in the host_.
> + */
Hi Sean and Paolo,
> + if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> + fpstate->regs.xsave.header.xfeatures &= ~xfd;
> +
> + fpstate->xfd = xfd;
> + if (fpstate->in_use)
> + xfd_update_state(fpstate);
I see a *small* window that the Host IRQ can happen just
after above TIF_NEED_FPU_LOAD checking, which could set
TIF_NEED_FPU_LOAD but w/o clear the xfd from
fpstate->regs.xsave.header.xfeatures.
But there's WARN in in kernel_fpu_begin_mask():
WARN_ON_FPU(!irq_fpu_usable());
irq_fpu_usable()
{
...
/*
* In hard interrupt context it's safe when soft interrupts
* are enabled, which means the interrupt did not hit in
* a fpregs_lock()'ed critical region.
*/
return !softirq_count();
}
Looks we are relying on this to catch the above *small* window
yet, we're in fpregs_lock() region yet.
Is this correct understanding ?
> +
> fpregs_unlock();
> }
> EXPORT_SYMBOL_FOR_KVM(fpu_update_guest_xfd);
> @@ -430,6 +449,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
> if (ustate->xsave.header.xfeatures & ~xcr0)
> return -EINVAL;
>
> + /*
> + * Disabled features must be in their initial state, otherwise XRSTOR
> + * causes an exception.
> + */
> + if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd))
> + return -EINVAL;
> +
> /*
> * Nullify @vpkru to preserve its current value if PKRU's bit isn't set
> * in the header. KVM's odd ABI is to leave PKRU untouched in this
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ff8812f3a129..c0416f53b5f5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5807,9 +5807,18 @@ static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> struct kvm_xsave *guest_xsave)
> {
> + union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region;
> +
> if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0;
>
> + /*
> + * Do not reject non-initialized disabled features for backwards
> + * compatibility, but clear XSTATE_BV[i] whenever XFD[i]=1.
> + * Otherwise, XRSTOR would cause a #NM.
> + */
> + xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd;
> +
> return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
> guest_xsave->region,
> kvm_caps.supported_xcr0,
> --
> 2.52.0
>
>
On Sat, Jan 03, 2026, Yao Yuan wrote:
> On Thu, Jan 01, 2026 at 10:05:13AM +0100, Paolo Bonzini wrote:
> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > index da233f20ae6f..166c380b0161 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
> > #ifdef CONFIG_X86_64
> > void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
> > {
> > + struct fpstate *fpstate = guest_fpu->fpstate;
> > +
> > fpregs_lock();
> > - guest_fpu->fpstate->xfd = xfd;
> > - if (guest_fpu->fpstate->in_use)
> > - xfd_update_state(guest_fpu->fpstate);
> > +
> > + /*
> > + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> > + * the save state to initialized. Likewise, KVM_GET_XSAVE does the
> > + * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> > + *
> > + * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> > + * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> > + *
> > + * If however the guest's FPU state is NOT resident in hardware, clear
> > + * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> > + * attempt to load disabled components and generate #NM _in the host_.
> > + */
>
> Hi Sean and Paolo,
>
> > + if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> > + fpstate->regs.xsave.header.xfeatures &= ~xfd;
> > +
> > + fpstate->xfd = xfd;
> > + if (fpstate->in_use)
> > + xfd_update_state(fpstate);
>
> I see a *small* window that the Host IRQ can happen just after above
> TIF_NEED_FPU_LOAD checking, which could set TIF_NEED_FPU_LOAD
Only if the code using FPU from IRQ context is buggy. More below.
> but w/o clear the xfd from fpstate->regs.xsave.header.xfeatures.
>
> But there's WARN in in kernel_fpu_begin_mask():
>
> WARN_ON_FPU(!irq_fpu_usable());
>
> irq_fpu_usable()
> {
> ...
> /*
> * In hard interrupt context it's safe when soft interrupts
> * are enabled, which means the interrupt did not hit in
> * a fpregs_lock()'ed critical region.
> */
> return !softirq_count();
> }
>
> Looks we are relying on this to catch the above *small* window
> yet, we're in fpregs_lock() region yet.
Kernel use of FPU from (soft) IRQ context is required to check irq_fpu_usable()
(e.g. via may_use_simd()), i.e. calling fpregs_lock() protects against the kernel
using the FPU and thus setting TIF_NEED_FPU_LOAD.
The WARN in kernel_fpu_begin_mask() is purely a sanity check to help detect and
debug buggy users.
On Mon, Jan 05, 2026 at 09:31:05AM +0800, Sean Christopherson wrote:
> On Sat, Jan 03, 2026, Yao Yuan wrote:
> > On Thu, Jan 01, 2026 at 10:05:13AM +0100, Paolo Bonzini wrote:
> > > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > > index da233f20ae6f..166c380b0161 100644
> > > --- a/arch/x86/kernel/fpu/core.c
> > > +++ b/arch/x86/kernel/fpu/core.c
> > > @@ -319,10 +319,29 @@ EXPORT_SYMBOL_FOR_KVM(fpu_enable_guest_xfd_features);
> > > #ifdef CONFIG_X86_64
> > > void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
> > > {
> > > + struct fpstate *fpstate = guest_fpu->fpstate;
> > > +
> > > fpregs_lock();
> > > - guest_fpu->fpstate->xfd = xfd;
> > > - if (guest_fpu->fpstate->in_use)
> > > - xfd_update_state(guest_fpu->fpstate);
> > > +
> > > + /*
> > > + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert
> > > + * the save state to initialized. Likewise, KVM_GET_XSAVE does the
> > > + * same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1.
> > > + *
> > > + * If the guest's FPU state is in hardware, just update XFD: the XSAVE
> > > + * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1.
> > > + *
> > > + * If however the guest's FPU state is NOT resident in hardware, clear
> > > + * disabled components in XSTATE_BV now, or a subsequent XRSTOR will
> > > + * attempt to load disabled components and generate #NM _in the host_.
> > > + */
> >
> > Hi Sean and Paolo,
> >
> > > + if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD))
> > > + fpstate->regs.xsave.header.xfeatures &= ~xfd;
> > > +
> > > + fpstate->xfd = xfd;
> > > + if (fpstate->in_use)
> > > + xfd_update_state(fpstate);
> >
> > I see a *small* window that the Host IRQ can happen just after above
> > TIF_NEED_FPU_LOAD checking, which could set TIF_NEED_FPU_LOAD
>
> Only if the code using FPU from IRQ context is buggy. More below.
>
> > but w/o clear the xfd from fpstate->regs.xsave.header.xfeatures.
> >
> > But there's WARN in in kernel_fpu_begin_mask():
> >
> > WARN_ON_FPU(!irq_fpu_usable());
> >
> > irq_fpu_usable()
> > {
> > ...
> > /*
> > * In hard interrupt context it's safe when soft interrupts
> > * are enabled, which means the interrupt did not hit in
> > * a fpregs_lock()'ed critical region.
> > */
> > return !softirq_count();
> > }
> >
> > Looks we are relying on this to catch the above *small* window
> > yet, we're in fpregs_lock() region yet.
>
> Kernel use of FPU from (soft) IRQ context is required to check irq_fpu_usable()
> (e.g. via may_use_simd()), i.e. calling fpregs_lock() protects against the kernel
> using the FPU and thus setting TIF_NEED_FPU_LOAD.
>
> The WARN in kernel_fpu_begin_mask() is purely a sanity check to help detect and
> debug buggy users.
OK, I have same understanding w/ you, thanks.
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
© 2016 - 2026 Red Hat, Inc.