Exit to userspace with -EFAULT and a valid MEMORY_FAULT exit if a vCPU
hits an unexpected pending S-EPT Violation instead of marking the VM dead.
While it's unlikely the VM can continue on, whether or not to terminate
the VM is not KVM's decision to make.
Set memory_fault.size to zero to communicate to userspace that reported
fault is "bad", and to effectively terminate the VM if userspace blindly
treats the exit as a conversion attempt (KVM_SET_MEMORY_ATTRIBUTES will
fail with -EINVAL if the size is zero).
Opportunistically delete the pr_warn(), which could be abused to spam the
kernel log, and is largely useless outside of interact debug as it doesn't
specify which VM encountered a failure.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/tdx.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3e0d4edee849..c2ef03f39c32 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
- pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
- gpa, vcpu->vcpu_id);
- kvm_vm_dead(vcpu->kvm);
- return -EIO;
+ kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true);
+ return -EFAULT;
}
/*
* Always treat SEPT violations as write faults. Ignore the
--
2.50.1.552.g942d659e1b-goog
On 7/30/2025 3:33 AM, Sean Christopherson wrote: > Exit to userspace with -EFAULT and a valid MEMORY_FAULT exit if a vCPU > hits an unexpected pending S-EPT Violation instead of marking the VM dead. > While it's unlikely the VM can continue on, whether or not to terminate > the VM is not KVM's decision to make. > > Set memory_fault.size to zero to communicate to userspace that reported > fault is "bad", and to effectively terminate the VM if userspace blindly > treats the exit as a conversion attempt (KVM_SET_MEMORY_ATTRIBUTES will > fail with -EINVAL if the size is zero). This sets a special contract on size zero. I had a patch internally, which introduce a new exit type: + /* KVM_EXIT_GUEST_ERROR */ + struct { + #define KVM_GUEST_ERROR_TDX_ACCESS_PENDING_PAGE 0 + __u32 error_type; + __u32 ndata; + __u64 data[16]; + } guest_error; how about it? > Opportunistically delete the pr_warn(), which could be abused to spam the > kernel log, and is largely useless outside of interact debug as it doesn't > specify which VM encountered a failure. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/tdx.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 3e0d4edee849..c2ef03f39c32 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) { > if (tdx_is_sept_violation_unexpected_pending(vcpu)) { > - pr_warn("Guest access before accepting 0x%llx on vCPU %d\n", > - gpa, vcpu->vcpu_id); > - kvm_vm_dead(vcpu->kvm); > - return -EIO; > + kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true); > + return -EFAULT; > } > /* > * Always treat SEPT violations as write faults. Ignore the
On Wed, Jul 30, 2025 at 10:07:36AM +0800, Xiaoyao Li wrote: > On 7/30/2025 3:33 AM, Sean Christopherson wrote: > > Exit to userspace with -EFAULT and a valid MEMORY_FAULT exit if a vCPU > > hits an unexpected pending S-EPT Violation instead of marking the VM dead. > > While it's unlikely the VM can continue on, whether or not to terminate > > the VM is not KVM's decision to make. > > > > Set memory_fault.size to zero to communicate to userspace that reported > > fault is "bad", and to effectively terminate the VM if userspace blindly > > treats the exit as a conversion attempt (KVM_SET_MEMORY_ATTRIBUTES will > > fail with -EINVAL if the size is zero). > > This sets a special contract on size zero. +1. Or would it be good to use pr_warn_once() to indicate that the VM termination reason is "Guest access before accepting" if a new exit type is not specified? This info is straightforward and helpful in cases when a TD is accidentally killed. > I had a patch internally, which introduce a new exit type: > > + /* KVM_EXIT_GUEST_ERROR */ > + struct { > + #define KVM_GUEST_ERROR_TDX_ACCESS_PENDING_PAGE 0 > + __u32 error_type; > + __u32 ndata; > + __u64 data[16]; > + } guest_error; > > how about it? > > > Opportunistically delete the pr_warn(), which could be abused to spam the > > kernel log, and is largely useless outside of interact debug as it doesn't > > specify which VM encountered a failure. > > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > --- > > arch/x86/kvm/vmx/tdx.c | 6 ++---- > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > index 3e0d4edee849..c2ef03f39c32 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) { > > if (tdx_is_sept_violation_unexpected_pending(vcpu)) { > > - pr_warn("Guest access before accepting 0x%llx on vCPU %d\n", > > - gpa, vcpu->vcpu_id); > > - kvm_vm_dead(vcpu->kvm); > > - return -EIO; > > + kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true); > > + return -EFAULT; > > } > > /* > > * Always treat SEPT violations as write faults. Ignore the > >
On Tue, 2025-07-29 at 12:33 -0700, Sean Christopherson wrote: > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 3e0d4edee849..c2ef03f39c32 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) { > if (tdx_is_sept_violation_unexpected_pending(vcpu)) { > - pr_warn("Guest access before accepting 0x%llx on vCPU %d\n", > - gpa, vcpu->vcpu_id); > - kvm_vm_dead(vcpu->kvm); > - return -EIO; > + kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true); > + return -EFAULT; > } > /* > * Always treat SEPT violations as write faults. Ignore the The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an attempt to set the mirror EPT entry when it is already present. And the unaccepted memory access will trigger an EPT violation for a mirror PTE that is already set. I think this is a better solution irrespective of the vm_dead changes. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> But hmm, tangentially related, but Yan do we have a similar problem with KVM_PRE_FAULT_MEMORY after we started setting pre_fault_allowed during TD finalization?
On Tue, Jul 29, 2025 at 10:27:34PM +0000, Edgecombe, Rick P wrote: > On Tue, 2025-07-29 at 12:33 -0700, Sean Christopherson wrote: > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > index 3e0d4edee849..c2ef03f39c32 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > > > if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) { > > if (tdx_is_sept_violation_unexpected_pending(vcpu)) { > > - pr_warn("Guest access before accepting 0x%llx on vCPU %d\n", > > - gpa, vcpu->vcpu_id); > > - kvm_vm_dead(vcpu->kvm); > > - return -EIO; > > + kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true); > > + return -EFAULT; > > } > > /* > > * Always treat SEPT violations as write faults. Ignore the > > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an > attempt to set the mirror EPT entry when it is already present. And the > unaccepted memory access will trigger an EPT violation for a mirror PTE that is > already set. I think this is a better solution irrespective of the vm_dead > changes. > > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > > But hmm, tangentially related, but Yan do we have a similar problem with > KVM_PRE_FAULT_MEMORY after we started setting pre_fault_allowed during TD > finalization? Sean's commit 6385d01eec16 ("KVM: x86/mmu: Don't overwrite shadow-present MMU SPTEs when prefaulting") should have prevented repeated invocation of set_external_spte_present() with prefaulted entries.
On Wed, 2025-07-30 at 13:55 +0800, Yan Zhao wrote: > > But hmm, tangentially related, but Yan do we have a similar problem with > > KVM_PRE_FAULT_MEMORY after we started setting pre_fault_allowed during TD > > finalization? > Sean's commit 6385d01eec16 ("KVM: x86/mmu: Don't overwrite shadow-present MMU > SPTEs when prefaulting") should have prevented repeated invocation of > set_external_spte_present() with prefaulted entries. Ok, great.
On Tue, Jul 29, 2025, Rick P Edgecombe wrote: > On Tue, 2025-07-29 at 12:33 -0700, Sean Christopherson wrote: > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > index 3e0d4edee849..c2ef03f39c32 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -1937,10 +1937,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > > > if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) { > > if (tdx_is_sept_violation_unexpected_pending(vcpu)) { > > - pr_warn("Guest access before accepting 0x%llx on vCPU %d\n", > > - gpa, vcpu->vcpu_id); > > - kvm_vm_dead(vcpu->kvm); > > - return -EIO; > > + kvm_prepare_memory_fault_exit(vcpu, gpa, 0, true, false, true); > > + return -EFAULT; > > } > > /* > > * Always treat SEPT violations as write faults. Ignore the > > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an > attempt to set the mirror EPT entry when it is already present. And the > unaccepted memory access will trigger an EPT violation for a mirror PTE that is > already set. I think this is a better solution irrespective of the vm_dead > changes. In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing prevents userspace from re-running the vCPU. Which KVM_BUG_ON() exactly gets hit?
On Tue, 2025-07-29 at 15:54 -0700, Sean Christopherson wrote: > > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an > > attempt to set the mirror EPT entry when it is already present. And the > > unaccepted memory access will trigger an EPT violation for a mirror PTE that > > is > > already set. I think this is a better solution irrespective of the vm_dead > > changes. > > In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing > prevents userspace from re-running the vCPU. If userspace runs the vCPU again then an EPT violation gets triggered again, which again gets kicked out to userspace. The new check will prevent it from getting into the fault handler, right? > Which KVM_BUG_ON() exactly gets hit? Should be: static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep, gfn_t gfn, u64 old_spte, u64 new_spte, int level) { bool was_present = is_shadow_present_pte(old_spte); bool is_present = is_shadow_present_pte(new_spte); bool is_leaf = is_present && is_last_spte(new_spte, level); kvm_pfn_t new_pfn = spte_to_pfn(new_spte); int ret = 0; KVM_BUG_ON(was_present, kvm);
On Tue, Jul 29, 2025 at 10:58:02PM +0000, Edgecombe, Rick P wrote: > On Tue, 2025-07-29 at 15:54 -0700, Sean Christopherson wrote: > > > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an > > > attempt to set the mirror EPT entry when it is already present. And the > > > unaccepted memory access will trigger an EPT violation for a mirror PTE that > > > is > > > already set. I think this is a better solution irrespective of the vm_dead > > > changes. > > > > In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing > > prevents userspace from re-running the vCPU. > > If userspace runs the vCPU again then an EPT violation gets triggered again, > which again gets kicked out to userspace. The new check will prevent it from > getting into the fault handler, right? > > > Which KVM_BUG_ON() exactly gets hit? > > Should be: > > static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t > sptep, > gfn_t gfn, u64 old_spte, > u64 new_spte, int level) > { > bool was_present = is_shadow_present_pte(old_spte); > bool is_present = is_shadow_present_pte(new_spte); > bool is_leaf = is_present && is_last_spte(new_spte, level); > kvm_pfn_t new_pfn = spte_to_pfn(new_spte); > int ret = 0; > > KVM_BUG_ON(was_present, kvm); Yes, this KVM_BUG_ON() could be triggered in an earlier code if tdx_is_sept_violation_unexpected_pending() wasn't checked to prevent repeated EPT faults. But the KVM_BUG_ON() is no longer reachable after the commit cc7ed3358e41 ("KVM: x86/mmu: Always set SPTE's dirty bit if it's created as writable"). With the current code, even without putting VM to dead, the worst case is the endless invocation of EPT fault handler, returning RET_PF_SPURIOUS.
On Tue, Jul 29, 2025, Rick P Edgecombe wrote: > On Tue, 2025-07-29 at 15:54 -0700, Sean Christopherson wrote: > > > The vm_dead was added because mirror EPT will KVM_BUG_ON() if there is an > > > attempt to set the mirror EPT entry when it is already present. And the > > > unaccepted memory access will trigger an EPT violation for a mirror PTE > > > that is already set. I think this is a better solution irrespective of > > > the vm_dead changes. > > > > In that case, this change will expose KVM to the KVM_BUG_ON(), because nothing > > prevents userspace from re-running the vCPU. > > If userspace runs the vCPU again then an EPT violation gets triggered again, > which again gets kicked out to userspace. The new check will prevent it from > getting into the fault handler, right? Yes? But I'm confused about why you mentioned vm_dead, and why you're calling this a "new check". This effectively does two things: drops kvm_vm_dead() and switches from EOI => EFAULT. _If_ setting vm_dead was necessary, then we have a problem. I assume by "The vm_dead was added" you really mean "forcing an exit to userspace", and that kvm_vm_dead()+EIO was a somewhat arbitrary way of forcing an exit? > > Which KVM_BUG_ON() exactly gets hit? > > Should be: > > static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t > sptep, > gfn_t gfn, u64 old_spte, > u64 new_spte, int level) > { > bool was_present = is_shadow_present_pte(old_spte); > bool is_present = is_shadow_present_pte(new_spte); > bool is_leaf = is_present && is_last_spte(new_spte, level); > kvm_pfn_t new_pfn = spte_to_pfn(new_spte); > int ret = 0; > > KVM_BUG_ON(was_present, kvm); Yeah, I don't see how that can be reach in this scenario.
On Tue, 2025-07-29 at 16:08 -0700, Sean Christopherson wrote: > > If userspace runs the vCPU again then an EPT violation gets triggered again, > > which again gets kicked out to userspace. The new check will prevent it from > > getting into the fault handler, right? > > Yes? But I'm confused about why you mentioned vm_dead, and why you're calling > this a "new check". This effectively does two things: drops kvm_vm_dead() and > switches from EOI => EFAULT. _If_ setting vm_dead was necessary, then we have > a > problem. > > I assume by "The vm_dead was added" you really mean "forcing an exit to > userspace", > and that kvm_vm_dead()+EIO was a somewhat arbitrary way of forcing an exit? Sorry, yes vm_dead prevents an EPT violation loop but not the KVM_BUG_ON(). The whole if clause prevents the KVM_BUG_ON(). Your patch prevents the ept violation loop in a better way.
© 2016 - 2025 Red Hat, Inc.