[v15] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

[PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

Posted by Michael Roth 1 year, 9 months ago

For hardware-protected VMs like SEV-SNP guests, certain conditions like
attempting to perform a write to a page which is not in the state that
the guest expects it to be in can result in a nested/extended #PF which
can only be satisfied by the host performing an implicit page state
change to transition the page into the expected shared/private state.
This is generally handled by generating a KVM_EXIT_MEMORY_FAULT event
that gets forwarded to userspace to handle via
KVM_SET_MEMORY_ATTRIBUTES.

However, the fast_page_fault() code might misconstrue this situation as
being the result of a write-protected access, and treat it as a spurious
case when it sees that writes are already allowed for the sPTE. This
results in the KVM MMU trying to resume the guest rather than taking any
action to satisfy the real source of the #PF such as generating a
KVM_EXIT_MEMORY_FAULT, resulting in the guest spinning on nested #PFs.

For now, just skip the fast path for hardware-protected VMs since they
don't currently utilize any of this access-tracking machinery anyway. In
the future, these considerations will need to be taken into account if
there's any need/desire to re-enable the fast path for
hardware-protected VMs.

Since software-protected VMs don't have a notion of a shared vs. private
that's separate from what KVM is tracking, the above
KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the special
handling for that case for now.

Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 62ad38b2a8c9..cecd8360378f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3296,7 +3296,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
 	return RET_PF_CONTINUE;
 }
 
-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	/*
 	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
@@ -3307,6 +3307,32 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 	if (fault->rsvd)
 		return false;
 
+	/*
+	 * For hardware-protected VMs, certain conditions like attempting to
+	 * perform a write to a page which is not in the state that the guest
+	 * expects it to be in can result in a nested/extended #PF. In this
+	 * case, the below code might misconstrue this situation as being the
+	 * result of a write-protected access, and treat it as a spurious case
+	 * rather than taking any action to satisfy the real source of the #PF
+	 * such as generating a KVM_EXIT_MEMORY_FAULT. This can lead to the
+	 * guest spinning on a #PF indefinitely.
+	 *
+	 * For now, just skip the fast path for hardware-protected VMs since
+	 * they don't currently utilize any of this machinery anyway. In the
+	 * future, these considerations will need to be taken into account if
+	 * there's any need/desire to re-enable the fast path for
+	 * hardware-protected VMs.
+	 *
+	 * Since software-protected VMs don't have a notion of a shared vs.
+	 * private that's separate from what KVM is tracking, the above
+	 * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
+	 * special handling for that case for now.
+	 */
+	if (kvm_slot_can_be_private(fault->slot) &&
+	    !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
+	      vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
+		return false;
+
 	/*
 	 * #PF can be fast if:
 	 *
@@ -3407,7 +3433,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 *sptep;
 	uint retry_count = 0;
 
-	if (!page_fault_can_be_fast(fault))
+	if (!page_fault_can_be_fast(vcpu, fault))
 		return ret;
 
 	walk_shadow_page_lockless_begin(vcpu);
-- 
2.25.1

Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

Posted by Sean Christopherson 1 year, 9 months ago

On Thu, May 09, 2024, Michael Roth wrote:
> ---
>  arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 62ad38b2a8c9..cecd8360378f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3296,7 +3296,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
>  	return RET_PF_CONTINUE;
>  }
>  
> -static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
> +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	/*
>  	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
> @@ -3307,6 +3307,32 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
>  	if (fault->rsvd)
>  		return false;
>  
> +	/*
> +	 * For hardware-protected VMs, certain conditions like attempting to
> +	 * perform a write to a page which is not in the state that the guest
> +	 * expects it to be in can result in a nested/extended #PF. In this
> +	 * case, the below code might misconstrue this situation as being the
> +	 * result of a write-protected access, and treat it as a spurious case
> +	 * rather than taking any action to satisfy the real source of the #PF
> +	 * such as generating a KVM_EXIT_MEMORY_FAULT. This can lead to the
> +	 * guest spinning on a #PF indefinitely.
> +	 *
> +	 * For now, just skip the fast path for hardware-protected VMs since
> +	 * they don't currently utilize any of this machinery anyway. In the
> +	 * future, these considerations will need to be taken into account if
> +	 * there's any need/desire to re-enable the fast path for
> +	 * hardware-protected VMs.
> +	 *
> +	 * Since software-protected VMs don't have a notion of a shared vs.
> +	 * private that's separate from what KVM is tracking, the above
> +	 * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> +	 * special handling for that case for now.

Very technically, it can occur if userspace _just_ modified the attributes.  And
as I've said multiple times, at least for now, I want to avoid special casing
SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
is to allow testing flows that are impossible to excercise without SNP/TDX hardware.

> +	 */
> +	if (kvm_slot_can_be_private(fault->slot) &&
> +	    !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> +	      vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))

Heh, !(x && y) kills me, I misread this like 4 times.

Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
cause in any reasonable way.  Can't this simply be?

	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
		return false;

Which is much, much more self-explanatory.

Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

Posted by Paolo Bonzini 1 year, 9 months ago

On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > +      * Since software-protected VMs don't have a notion of a shared vs.
> > +      * private that's separate from what KVM is tracking, the above
> > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > +      * special handling for that case for now.
>
> Very technically, it can occur if userspace _just_ modified the attributes.  And
> as I've said multiple times, at least for now, I want to avoid special casing
> SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> is to allow testing flows that are impossible to excercise without SNP/TDX hardware.

Yep, it is not like they have to be optimized.

> > +      */
> > +     if (kvm_slot_can_be_private(fault->slot) &&
> > +         !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > +           vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
>
> Heh, !(x && y) kills me, I misread this like 4 times.
>
> Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
> cause in any reasonable way.  Can't this simply be?
>
>         if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
>                 return false;

You beat me to it by seconds. And it can also be guarded by a check on
kvm->arch.has_private_mem to avoid the attributes lookup.

> Which is much, much more self-explanatory.

Both more self-explanatory and more correct.

Paolo

Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

Posted by Michael Roth 1 year, 9 months ago

On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > +      * private that's separate from what KVM is tracking, the above
> > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > +      * special handling for that case for now.
> >
> > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > as I've said multiple times, at least for now, I want to avoid special casing
> > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> 
> Yep, it is not like they have to be optimized.

Ok, I thought there were maybe some future plans to use sw-protected VMs
to get some added protections from userspace. But even then there'd
probably still be extra considerations for how to handle access tracking
so white-listing them probably isn't right anyway.

I was also partly tempted to take this route because it would cover this
TDX patch as well:

  https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

and avoid any weirdness about checking kvm_mem_is_private() without
checking mmu_invalidate_seq, but I think those cases all end up
resolving themselves eventually and added some comments around that.

> 
> > > +      */
> > > +     if (kvm_slot_can_be_private(fault->slot) &&
> > > +         !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > > +           vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
> >
> > Heh, !(x && y) kills me, I misread this like 4 times.
> >
> > Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
> > cause in any reasonable way.  Can't this simply be?
> >
> >         if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
> >                 return false;
> 
> You beat me to it by seconds. And it can also be guarded by a check on
> kvm->arch.has_private_mem to avoid the attributes lookup.

I re-tested with things implemented this way and everything seems to
look good. It's not clear to me whether this would cover the cases the
above-mentioned TDX patch handles, but no biggie if that's still needed.

The new version of the patch is here:

  https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208

And I've updated my branches with to replace the old patch and also
incorporate Sean's suggestions for patch 22:

  https://github.com/mdroth/linux/commits/snp-host-v15c3-unsquashed

and have them here with things already squashed in/relocated:

  https://github.com/mdroth/linux/commits/snp-host-v15c3

Thanks for the feedback Sean, Paolo.

-Mike

> 
> > Which is much, much more self-explanatory.
> 
> Both more self-explanatory and more correct.
> 
> Paolo
> 
>

Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

Posted by Sean Christopherson 1 year, 9 months ago

On Fri, May 10, 2024, Michael Roth wrote:
> On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > > +      * private that's separate from what KVM is tracking, the above
> > > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > > +      * special handling for that case for now.
> > >
> > > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > > as I've said multiple times, at least for now, I want to avoid special casing
> > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> > 
> > Yep, it is not like they have to be optimized.
> 
> Ok, I thought there were maybe some future plans to use sw-protected VMs
> to get some added protections from userspace. But even then there'd
> probably still be extra considerations for how to handle access tracking
> so white-listing them probably isn't right anyway.
> 
> I was also partly tempted to take this route because it would cover this
> TDX patch as well:
> 
>   https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are
fixing, just in a less precise way.  S-EPT entries only support RWX=0 and RWX=111b,
i.e. it should be impossible to have a write-fault to a present S-EPT entry.

And if TDX is running afoul of this code:

	if (!fault->present)
		return !kvm_ad_enabled();

then KVM should do the sane thing and require A/D support be enabled for TDX.

And if it's something else entirely, that changelog has some explaining to do.

> and avoid any weirdness about checking kvm_mem_is_private() without
> checking mmu_invalidate_seq, but I think those cases all end up
> resolving themselves eventually and added some comments around that.

Yep, checking state that is protected by mmu_invalidate_seq outside of mmu_lock
is definitely allowed, e.g. the entire fast page fault path operates outside of
mmu_lock and thus outside of mmu_invalidate_seq's purview.

It's a-ok because the SPTE are done with an atomic CMPXCHG, and so KVM only needs
to ensure its page tables aren't outright _freed_.  If the zap triggered by the
attributes change "wins", then the fast #PF path will fail the CMPXCHG and be an
expensive NOP.  If the fast #PF wins, the zap will pave over the fast #PF fix,
and the IPI+flush that is needed for all zaps, to ensure vCPUs don't have stale
references, does the rest.

And if there's an attributes race that causes the fast #PF to bail early, the vCPU
will see the correct state on the next page fault.

Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

Posted by Isaku Yamahata 1 year, 9 months ago

On Fri, May 10, 2024 at 08:59:09AM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, May 10, 2024, Michael Roth wrote:
> > On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> > > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > > > +      * private that's separate from what KVM is tracking, the above
> > > > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > > > +      * special handling for that case for now.
> > > >
> > > > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > > > as I've said multiple times, at least for now, I want to avoid special casing
> > > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> > > 
> > > Yep, it is not like they have to be optimized.
> > 
> > Ok, I thought there were maybe some future plans to use sw-protected VMs
> > to get some added protections from userspace. But even then there'd
> > probably still be extra considerations for how to handle access tracking
> > so white-listing them probably isn't right anyway.
> > 
> > I was also partly tempted to take this route because it would cover this
> > TDX patch as well:
> > 
> >   https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/
> 
> Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are
> fixing, just in a less precise way.  S-EPT entries only support RWX=0 and RWX=111b,
> i.e. it should be impossible to have a write-fault to a present S-EPT entry.
> 
> And if TDX is running afoul of this code:
> 
> 	if (!fault->present)
> 		return !kvm_ad_enabled();
> 
> then KVM should do the sane thing and require A/D support be enabled for TDX.
> 
> And if it's something else entirely, that changelog has some explaining to do.

Yes, it's for KVM_EXIT_MEMORY_FAULT case.  Because Secure-EPT has non-present or
all RWX allowed, fast page fault always returns RET_PF_INVALID by
is_shadow_present_pte() check.

I lightly tested the patch at [1] and it works for TDX KVM.

[1] https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208

Just in case for that patch,
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-- 
Isaku Yamahata <isaku.yamahata@intel.com>

[PATCH v15 22/23] KVM: SEV: Fix return code interpretation for RMP nested page faults

Posted by Michael Roth 1 year, 9 months ago

The intended logic when handling #NPFs with the RMP bit set (31) is to
first check to see if the #NPF requires a shared<->private transition
and, if so, to go ahead and let the corresponding KVM_EXIT_MEMORY_FAULT
get forwarded on to userspace before proceeding with any handling of
other potential RMP fault conditions like needing to PSMASH the RMP
entry/etc (which will be done later if the guest still re-faults after
the KVM_EXIT_MEMORY_FAULT is processed by userspace).

The determination of whether any userspace handling of
KVM_EXIT_MEMORY_FAULT is needed is done by interpreting the return code
of kvm_mmu_page_fault(). However, the current code misinterprets the
return code, expecting 0 to indicate a userspace exit rather than less
than 0 (-EFAULT). This leads to the following unexpected behavior:

  - for KVM_EXIT_MEMORY_FAULTs resulting for implicit shared->private
    conversions, warnings get printed from sev_handle_rmp_fault()
    because it does not expect to be called for GPAs where
    KVM_MEMORY_ATTRIBUTE_PRIVATE is not set. Standard linux guests don't
    generally do this, but it is allowed and should be handled
    similarly to private->shared conversions rather than triggering any
    sort of warnings

  - if gmem support for 2MB folios is enabled (via currently out-of-tree
    code), implicit shared<->private conversions will always result in
    a PSMASH being attempted, even if it's not actually needed to
    resolve the RMP fault. This doesn't cause any harm, but results in a
    needless PSMASH and zapping of the sPTE

Resolve these issues by calling sev_handle_rmp_fault() only when
kvm_mmu_page_fault()'s return code is greater than or equal to 0,
indicating a KVM_MEMORY_EXIT_FAULT/-EFAULT isn't needed. While here,
simplify the code slightly and fix up the associated comments for better
clarity.

Fixes: ccc9d836c5c3 ("KVM: SEV: Add support to handle RMP nested page faults")

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/kvm/svm/svm.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 426ad49325d7..9431ce74c7d4 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2070,14 +2070,12 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 				svm->vmcb->control.insn_len);
 
 	/*
-	 * rc == 0 indicates a userspace exit is needed to handle page
-	 * transitions, so do that first before updating the RMP table.
+	 * rc < 0 indicates a userspace exit may be needed to handle page
+	 * attribute updates, so deal with that first before handling other
+	 * potential RMP fault conditions.
 	 */
-	if (error_code & PFERR_GUEST_RMP_MASK) {
-		if (rc == 0)
-			return rc;
+	if (rc >= 0 && error_code & PFERR_GUEST_RMP_MASK)
 		sev_handle_rmp_fault(vcpu, fault_address, error_code);
-	}
 
 	return rc;
 }
-- 
2.25.1