KVM: guest_memfd(), X86: Common base for SNP and TDX (was KVM: guest memory: Misc enhancement)

[RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

Posted by isaku.yamahata@intel.com 2 years, 6 months ago

From: Michael Roth <michael.roth@amd.com>

All gmem pages are expected to be 'private' as defined by a particular
arch/platform. Platforms like SEV-SNP require additional operations to
move these pages into a private state, so implement a hook that can be
used to prepare this memory prior to mapping it into a guest.

In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table, so this hook will also be used by the KVM MMU to clamp
the maximum mapping size accordingly.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/r/20230612042559.375660-2-michael.roth@amd.com

---
Changes v2 -> v3:
- Newly added
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 +++
 arch/x86/kvm/mmu/mmu.c             | 12 ++++++++++--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index c0143906fe6d..a4cb248519cf 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e4f2938bb1fc..de7f0dffa135 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1735,6 +1735,9 @@ struct kvm_x86_ops {
 	 * Returns vCPU specific APICv inhibit reasons
 	 */
 	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+	int (*gmem_prepare)(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a73ddb43a2cf..35bb14363828 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4352,6 +4352,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 				   struct kvm_page_fault *fault)
 {
 	int max_order, r;
+	u8 max_level;
 
 	if (!kvm_slot_can_be_private(fault->slot))
 		return kvm_do_memory_fault_exit(vcpu, fault);
@@ -4361,8 +4362,15 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	if (r)
 		return r;
 
-	fault->max_level = min(kvm_max_level_for_order(max_order),
-			       fault->max_level);
+	max_level = kvm_max_level_for_order(max_order);
+	r = static_call(kvm_x86_gmem_prepare)(vcpu->kvm, fault->slot, fault->pfn,
+					      fault->gfn, &max_level);
+	if (r) {
+		kvm_release_pfn_clean(fault->pfn);
+		return r;
+	}
+
+	fault->max_level = min(max_level, fault->max_level);
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
 	return RET_PF_CONTINUE;
 }
-- 
2.25.1

Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

Posted by Sean Christopherson 2 years, 6 months ago

On Thu, Jul 20, 2023, isaku.yamahata@intel.com wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a73ddb43a2cf..35bb14363828 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4352,6 +4352,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
>  				   struct kvm_page_fault *fault)
>  {
>  	int max_order, r;
> +	u8 max_level;
>  
>  	if (!kvm_slot_can_be_private(fault->slot))
>  		return kvm_do_memory_fault_exit(vcpu, fault);
> @@ -4361,8 +4362,15 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
>  	if (r)
>  		return r;
>  
> -	fault->max_level = min(kvm_max_level_for_order(max_order),
> -			       fault->max_level);
> +	max_level = kvm_max_level_for_order(max_order);
> +	r = static_call(kvm_x86_gmem_prepare)(vcpu->kvm, fault->slot, fault->pfn,
> +					      fault->gfn, &max_level);

I think KVM should hook gmem itself, i.e. convert pages to private when they're
allocated, and then back to shared when they're freed.  I don't like having
asymmetric APIs (convert on fault instead of allocate, but then convert back on
free instead of on zap), and hooking the page fault path also violates the approach
of tying the RMP/HKID to the physical allocation.

Practically speaking, hooking the fault path will result in undesirable behavior.
Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
shares a single 4KiB page in a 2MiB chunk.   Dirty logging is another case where
the RMP can stay at 2MiB.  Or as a very silly example, imagine userspace pulls a
stupid and covers a single 2MiB chunk of gmem with 512 memslots.

That likely means KVM will need an additional hook to clamp the max_level at the
RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
should be able to blindly do the conversion because it would be a kernel bug if
the page is already assigned to an ASID in the RMP, i.e. the additional hook
shouldn't incur an extra RMP lookup.

Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

Posted by Michael Roth 2 years, 6 months ago

On Fri, Jul 21, 2023 at 07:25:58AM -0700, Sean Christopherson wrote:
> On Thu, Jul 20, 2023, isaku.yamahata@intel.com wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a73ddb43a2cf..35bb14363828 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4352,6 +4352,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> >  				   struct kvm_page_fault *fault)
> >  {
> >  	int max_order, r;
> > +	u8 max_level;
> >  
> >  	if (!kvm_slot_can_be_private(fault->slot))
> >  		return kvm_do_memory_fault_exit(vcpu, fault);
> > @@ -4361,8 +4362,15 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> >  	if (r)
> >  		return r;
> >  
> > -	fault->max_level = min(kvm_max_level_for_order(max_order),
> > -			       fault->max_level);
> > +	max_level = kvm_max_level_for_order(max_order);
> > +	r = static_call(kvm_x86_gmem_prepare)(vcpu->kvm, fault->slot, fault->pfn,
> > +					      fault->gfn, &max_level);
> 
> I think KVM should hook gmem itself, i.e. convert pages to private when they're
> allocated, and then back to shared when they're freed.  I don't like having
> asymmetric APIs (convert on fault instead of allocate, but then convert back on
> free instead of on zap), and hooking the page fault path also violates the approach
> of tying the RMP/HKID to the physical allocation.

I'd originally added an arch hook in kvm_gmem_get_pfn() to handle
RMP/HKID when the folio is allocated:

  https://github.com/mdroth/linux/commit/adf78d224126f31e9096b80be21619e1ba447304

Based on your response it seemed like there was a preference to either:

  a) do the RMP/HKID update via KVM MMU prior to mapping into the guest
  b) implement an ioctl() to let userspace pre-allocate/pre-prepare and
     do the RMP updates in advance.

  https://lore.kernel.org/lkml/20230522135036.wnvsmryhkvstwvw2@amd.com/

So this patch basically implements suggestion a), or at least my
understanding of it. Is the original patch that does it via
kvm_gmem_get_pfn() more in line with what you're suggesting here, or
were you still thinking of an ioctl? Or some other place in the code?
Keep in mind the GPA is needed to do the RMP updates so that prevents
us from hooking into the more obvious place like kvm_gmem_get_folio().

> 
> Practically speaking, hooking the fault path will result in undesirable behavior.
> Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
> at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
> shares a single 4KiB page in a 2MiB chunk.   Dirty logging is another case where
> the RMP can stay at 2MiB.  Or as a very silly example, imagine userspace pulls a
> stupid and covers a single 2MiB chunk of gmem with 512 memslots.

Unfortunately I don't think things aren't quite that flexible with SNP. If
RMP entry is 2MB, and you map a sub-page as 4K in the NPT, you'll immediately
get a PFERR_GUEST_SIZEM on the first access (presumably when the guest tries
to PVALIDATE it before use). The RMP fault handler will then subsequently
need to PSMASH the 2MB entry into 4K before that guest can access it. So you
get an extra page fault for every 2MB page that's mapped this way.
(APM Volume 2, Section 15.36.10).

That might not be a big deal for guests that are at least somewhat optimized
to make use of 2MB pages, but another situation is:

  - gmem allocates 2MB page
  - guest issues PVALIDATE on 2MB page
  - guest later converts a subpage to shared but doesn't holepunch
  - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
  - KVM MMU splits NPT mapping to 4K
  - guest converts that shared page back to private

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are validated/4K and cannot be promoted on
the hypervisor, so the NPT mappings must still be limited to 4K to
match this, otherwise we hit the reverse of the PFERR_GUEST_SIZEM
scenario above, where the NPT mapping level is *larger* than the RMP
entry level. Unfortunately that does not result in a PFERR_GUEST_SIZEM
where we can fix things up in response, but instead it's a general
RMP fault that would be tricky to distinguish from an RMP fault
resulting from an implicit page conversion or some other guest weirdness
without doing RMP table checks every time we get a general RMP fault.

So for all intents and purposes it does sort of end up being the case
that the mapping size and RMP entry size are sort of intertwined and
can't totally be decoupled, and if you don't take that into account
when updating the RMP entry, you'll have to do some extra PSMASH's
in response to PFERR_GUEST_SIZEM RMP faults later.

> 
> That likely means KVM will need an additional hook to clamp the max_level at the
> RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
> should be able to blindly do the conversion because it would be a kernel bug if
> the page is already assigned to an ASID in the RMP, i.e. the additional hook
> shouldn't incur an extra RMP lookup.

Yah we'd still need a hook in roughly this same place for clamping
max_level. Previous versions of SNP hypervisor patches all had a separate
hook for handling these cases, but since the work of updating the RMP table
prior to mapping isn't too dissimilar from the work of determining max
mapping size, I combined both of them into the kvm_x86_gmem_prepare()
hook/implementation.

But I don't see any major issue with moving RMPUPDATE handling to an
allocation-time hook. As mentioned above we'd get additional
PFERR_GUEST_SIZEM faults by not taking MMU mapping level into account, but
I previously had it implemented similarly via a hook in kvm_gmem_get_pfn()
(because we need the GPA) and didn't notice anything major. But I'm not
sure exactly where you're suggesting we do it now, so could use some
clarify on that if kvm_gmem_get_pfn() isn't what you had in mind.

WRT avoiding the RMP lookup, I'd tried __filemap_get_folio() without
FGP_CREAT flag as a way to check whether the folio was already allocated
or not, but it seemed to result in a lockup, and didn't look like it
would be any better performance-wise than just doing the RMP lookup
anyway (which is just a normal memory read), so that's the route I ended
up going.

-Mike

Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

Posted by Sean Christopherson 2 years, 5 months ago

Sorry for responding so late, lost track of this and only found it against when
reviewing the next version :-/

On Fri, Jul 21, 2023, Michael Roth wrote:
> On Fri, Jul 21, 2023 at 07:25:58AM -0700, Sean Christopherson wrote:
> > Practically speaking, hooking the fault path will result in undesirable behavior.
> > Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
> > at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
> > shares a single 4KiB page in a 2MiB chunk.   Dirty logging is another case where
> > the RMP can stay at 2MiB.  Or as a very silly example, imagine userspace pulls a
> > stupid and covers a single 2MiB chunk of gmem with 512 memslots.
> 
> Unfortunately I don't think things aren't quite that flexible with SNP. If
> RMP entry is 2MB, and you map a sub-page as 4K in the NPT, you'll immediately
> get a PFERR_GUEST_SIZEM on the first access (presumably when the guest tries
> to PVALIDATE it before use). The RMP fault handler will then subsequently
> need to PSMASH the 2MB entry into 4K before that guest can access it. So you
> get an extra page fault for every 2MB page that's mapped this way.
> (APM Volume 2, Section 15.36.10).

Well that's just bloody stupid.  Why is that a restriction?  Obviously creating
an NPT mapping that's larger would be annoying to handle, e.g. would require
locking multiple entries or something, so I can understand why that's disallowed.
But why can't software map at a finer granularity?

Note, I'm expecting a spec change, just expressing a bit of disbelief.

Anyways, IMO, we should eat the extra #NPF in that scenario and optimize for much,
much more likely scenario of the RMP and NPT both being able to use 2MiB pages.
And that means not inserting into the RMP when handling #NPFs, e.g. so that userspace
can do fallocate() to prep the memory before mapping, and so that if the SPTEs
get zapped for whatever reason, faulting them back in doesn't trigger useless
RMP updates.

I guess the other way to optimze things would be for userspace to use the ioctl()
to map memory into the guest[*].  But even then, initializing when allocating
seems cleaner, especially for TDX.

[*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com

> That might not be a big deal for guests that are at least somewhat optimized
> to make use of 2MB pages, but another situation is:
> 
>   - gmem allocates 2MB page
>   - guest issues PVALIDATE on 2MB page
>   - guest later converts a subpage to shared but doesn't holepunch
>   - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
>   - KVM MMU splits NPT mapping to 4K
>   - guest converts that shared page back to private
> 
> At this point there are no mixed attributes, and KVM would normally
> allow for 2MB NPT mappings again, but this is actually not allowed
> because the RMP table mappings are validated/4K and cannot be promoted on
> the hypervisor, so the NPT mappings must still be limited to 4K to
> match this, otherwise we hit the reverse of the PFERR_GUEST_SIZEM
> scenario above, where the NPT mapping level is *larger* than the RMP
> entry level. Unfortunately that does not result in a PFERR_GUEST_SIZEM
> where we can fix things up in response, but instead it's a general
> RMP fault that would be tricky to distinguish from an RMP fault
> resulting from an implicit page conversion or some other guest weirdness
> without doing RMP table checks every time we get a general RMP fault.

This seems like a bug in the SNP code.  (a) why would KVM/SNP PSMASH in that
scenario and (b) why can't it zap/split the NPT before PSMASH?

> So for all intents and purposes it does sort of end up being the case
> that the mapping size and RMP entry size are sort of intertwined and
> can't totally be decoupled, and if you don't take that into account
> when updating the RMP entry, you'll have to do some extra PSMASH's
> in response to PFERR_GUEST_SIZEM RMP faults later.
> 
> > 
> > That likely means KVM will need an additional hook to clamp the max_level at the
> > RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
> > should be able to blindly do the conversion because it would be a kernel bug if
> > the page is already assigned to an ASID in the RMP, i.e. the additional hook
> > shouldn't incur an extra RMP lookup.
> 
> Yah we'd still need a hook in roughly this same place for clamping
> max_level. Previous versions of SNP hypervisor patches all had a separate
> hook for handling these cases, but since the work of updating the RMP table
> prior to mapping isn't too dissimilar from the work of determining max
> mapping size, I combined both of them into the kvm_x86_gmem_prepare()
> hook/implementation.
> 
> But I don't see any major issue with moving RMPUPDATE handling to an
> allocation-time hook. As mentioned above we'd get additional
> PFERR_GUEST_SIZEM faults by not taking MMU mapping level into account, but
> I previously had it implemented similarly via a hook in kvm_gmem_get_pfn()
> (because we need the GPA) and didn't notice anything major. But I'm not
> sure exactly where you're suggesting we do it now, so could use some
> clarify on that if kvm_gmem_get_pfn() isn't what you had in mind.

I was thinking kvm_gmem_get_folio().  If userspace doesn't preallocate, then
kvm_gmem_get_pfn() will still east the cost of the conversion, but it at least
gives userspace the option and handles the zap case.  Tracking which folios have
been assigned (HKID or RMP) should be pretty straightforward.

I'm not totally opposed to doing more work at map time, e.g. to avoid faults or
to play nice with PSMASH, but I would rather that not bleed into gmem.   And I
think/hope if we hook kvm_gmem_get_folio(), then TDX and SNP can use that as a
common base, i.e. whatever extra shenanigans are needed for SNP can be carried
in the SNP series.  For me, having such "quirks" in the vendor specific series
would be quite helpful as I'm having a hell of a time keeping track of all the
wrinkles of TDX and SNP.

Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

Posted by Michael Roth 2 years, 5 months ago

On Fri, Aug 18, 2023 at 03:27:47PM -0700, Sean Christopherson wrote:
> Sorry for responding so late, lost track of this and only found it against when
> reviewing the next version :-/
> 
> On Fri, Jul 21, 2023, Michael Roth wrote:
> > On Fri, Jul 21, 2023 at 07:25:58AM -0700, Sean Christopherson wrote:
> > > Practically speaking, hooking the fault path will result in undesirable behavior.
> > > Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
> > > at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
> > > shares a single 4KiB page in a 2MiB chunk.   Dirty logging is another case where
> > > the RMP can stay at 2MiB.  Or as a very silly example, imagine userspace pulls a
> > > stupid and covers a single 2MiB chunk of gmem with 512 memslots.
> > 
> > Unfortunately I don't think things aren't quite that flexible with SNP. If
> > RMP entry is 2MB, and you map a sub-page as 4K in the NPT, you'll immediately
> > get a PFERR_GUEST_SIZEM on the first access (presumably when the guest tries
> > to PVALIDATE it before use). The RMP fault handler will then subsequently
> > need to PSMASH the 2MB entry into 4K before that guest can access it. So you
> > get an extra page fault for every 2MB page that's mapped this way.
> > (APM Volume 2, Section 15.36.10).
> 
> Well that's just bloody stupid.  Why is that a restriction?  Obviously creating
> an NPT mapping that's larger would be annoying to handle, e.g. would require
> locking multiple entries or something, so I can understand why that's disallowed.
> But why can't software map at a finer granularity?
> 
> Note, I'm expecting a spec change, just expressing a bit of disbelief.

I asked David Kaplan about this and one of the reasons is that it allows
hardware to reduce the number of RMP table accesses when doing table
walks. E.g. if the mapping is 4K in the NPT, then only that RMP entry
needs to be checked, the hardware doesn't need to also then check the
2MB-aligned RMP entry before determining whether to generate a #NPF.

(note: the 'assigned' bit does get set on each individual 4K entry when
RMPUPDATE use the 'page-size' bit to create a 2MB RMP entry, but if you
look at the pseudo-code in the APM for RMPUPDATE it actually leaves
ASID=0, so those accesses always generate an #NPF(RMP) and aren't actually
usable for 4K NPT mappings of sub-pages for the 2MB range corresponding
to that 2MB RMP entry (and not #NPF+RMP+SIZEM as I incorrectly stated
above))

> 
> Anyways, IMO, we should eat the extra #NPF in that scenario and optimize for much,
> much more likely scenario of the RMP and NPT both being able to use 2MiB pages.

That seems fair, I only see this particularly case occur once or twice
for boot.

> And that means not inserting into the RMP when handling #NPFs, e.g. so that userspace
> can do fallocate() to prep the memory before mapping, and so that if the SPTEs

Makes sense.

> get zapped for whatever reason, faulting them back in doesn't trigger useless
> RMP updates.

We wouldn't issue an RMPUPDATE if we saw that the page was already set to
private in the RMP table. So what you actually save on is just accesses to
the memory range containing the RMP entry to do that check, but the approach
still makes sense.

> 
> I guess the other way to optimze things would be for userspace to use the ioctl()
> to map memory into the guest[*].  But even then, initializing when allocating
> seems cleaner, especially for TDX.
> 
> [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com
> 
> > That might not be a big deal for guests that are at least somewhat optimized
> > to make use of 2MB pages, but another situation is:
> > 
> >   - gmem allocates 2MB page
> >   - guest issues PVALIDATE on 2MB page
> >   - guest later converts a subpage to shared but doesn't holepunch
> >   - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
> >   - KVM MMU splits NPT mapping to 4K
> >   - guest converts that shared page back to private
> > 
> > At this point there are no mixed attributes, and KVM would normally
> > allow for 2MB NPT mappings again, but this is actually not allowed
> > because the RMP table mappings are validated/4K and cannot be promoted on
> > the hypervisor, so the NPT mappings must still be limited to 4K to
> > match this, otherwise we hit the reverse of the PFERR_GUEST_SIZEM
> > scenario above, where the NPT mapping level is *larger* than the RMP
> > entry level. Unfortunately that does not result in a PFERR_GUEST_SIZEM
> > where we can fix things up in response, but instead it's a general
> > RMP fault that would be tricky to distinguish from an RMP fault
> > resulting from an implicit page conversion or some other guest weirdness
> > without doing RMP table checks every time we get a general RMP fault.
> 
> This seems like a bug in the SNP code.  (a) why would KVM/SNP PSMASH in that
> scenario and (b) why can't it zap/split the NPT before PSMASH?

a) A PSMASH will be needed at some point, since, as detailed above, the 4K
   NPT mapping requires the RMP entries for the pages it maps to be
   limited to 4K RMP entries, but...
b) What would happen normally[1] is the guest would issue PVALIDATE to
   *rescind* the validated status of that 4K GPA before issuing the GHCB
   request to convert it to shared. This would cause an
   #NPF(RMP+SIZE_MISMATCH) and handle_rmp_page_fault() would PSMASH the RMP
   entry so the PVALIDATE can succeed.

   So KVM doesn't even really have the option of deferring the PSMASH, it
   happens before the SET_MEMORY_ATTRIBUTES is even issued to zap the 2MB
   NPT mapping and switch the 2M range to 'mixed'. Because of that, we also
   need a hook in the KVM MMU code to clamp the max mapping level based
   on RMP entry size. Currently the kvm_gmem_prepare() in this patch
   doubles for handling that clamping, so we would still need a similar
   hook for that if we move the RMP initialization stuff to allocation
   time.

[1] This handling is recommended for 'well-behaved' guests according to
GHCB, but I don't see it documented as a hard requirement anywhere, so there
is a possibility that that we have to deal with a guest that doesn't do this.
What would happen then is the PVALIDATE wouldn't trigger the #NPF(RMP+SIZEM),
and instead the SET_MEMORY_ATTRIBUTES would zap the 2MB mapping, install
4K entries on next #NPF(NOT_PRESENT), and at *that* point we would get
an #NPF(RMP) without the SIZEM bit set, due to the behavior described in
the beginning of this email.

handle_rmp_page_fault() can do the corresponding PSMASH to deal with that,
but it is a little unfortunate since we can't differentiate that case from a
spurious/unexpected RMP faults, so would need to attempt a PSMASH in all
cases, sometimes failing.

gmem itself could also trigger this case if the lpage_info_slot() tracking
ever became more granular than what the guest was expected (which I don't
think would happen normally, but I may have hit one case where it does, but
haven't had a chance to debug if that's on the lpage_info_slot() side or
something else on the SNP end.

> 
> > So for all intents and purposes it does sort of end up being the case
> > that the mapping size and RMP entry size are sort of intertwined and
> > can't totally be decoupled, and if you don't take that into account
> > when updating the RMP entry, you'll have to do some extra PSMASH's
> > in response to PFERR_GUEST_SIZEM RMP faults later.
> > 
> > > 
> > > That likely means KVM will need an additional hook to clamp the max_level at the
> > > RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
> > > should be able to blindly do the conversion because it would be a kernel bug if
> > > the page is already assigned to an ASID in the RMP, i.e. the additional hook
> > > shouldn't incur an extra RMP lookup.
> > 
> > Yah we'd still need a hook in roughly this same place for clamping
> > max_level. Previous versions of SNP hypervisor patches all had a separate
> > hook for handling these cases, but since the work of updating the RMP table
> > prior to mapping isn't too dissimilar from the work of determining max
> > mapping size, I combined both of them into the kvm_x86_gmem_prepare()
> > hook/implementation.
> > 
> > But I don't see any major issue with moving RMPUPDATE handling to an
> > allocation-time hook. As mentioned above we'd get additional
> > PFERR_GUEST_SIZEM faults by not taking MMU mapping level into account, but
> > I previously had it implemented similarly via a hook in kvm_gmem_get_pfn()
> > (because we need the GPA) and didn't notice anything major. But I'm not
> > sure exactly where you're suggesting we do it now, so could use some
> > clarify on that if kvm_gmem_get_pfn() isn't what you had in mind.
> 
> I was thinking kvm_gmem_get_folio().  If userspace doesn't preallocate, then
> kvm_gmem_get_pfn() will still east the cost of the conversion, but it at least
> gives userspace the option and handles the zap case.  Tracking which folios have
> been assigned (HKID or RMP) should be pretty straightforward.

This is a little tricky. kvm_gmem_get_pfn() can handle RMP entry
initialization, no problem, because it has the GPA we are fetching the
PFN for.

But kvm_gmem_get_folio() doesn't. But as long as the preallocation
happens after kvm_gmem_bind() and before kvm_gmem_unbind() we should be
able to use the slot to figure out what GPA a particular FD offset
corresponds to. I'll look into this more.

> 
> I'm not totally opposed to doing more work at map time, e.g. to avoid faults or
> to play nice with PSMASH, but I would rather that not bleed into gmem.   And I
> think/hope if we hook kvm_gmem_get_folio(), then TDX and SNP can use that as a
> common base, i.e. whatever extra shenanigans are needed for SNP can be carried
> in the SNP series.  For me, having such "quirks" in the vendor specific series
> would be quite helpful as I'm having a hell of a time keeping track of all the
> wrinkles of TDX and SNP.

Seems reasonable. I'll work on re-working the hooks using this approach.

Thanks,

Mike

Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

Posted by Michael Roth 2 years, 5 months ago

On Fri, Aug 25, 2023 at 07:59:41PM -0500, Michael Roth wrote:
> On Fri, Aug 18, 2023 at 03:27:47PM -0700, Sean Christopherson wrote:
> > This seems like a bug in the SNP code.  (a) why would KVM/SNP PSMASH in that
> > scenario and (b) why can't it zap/split the NPT before PSMASH?
> 
> a) A PSMASH will be needed at some point, since, as detailed above, the 4K
>    NPT mapping requires the RMP entries for the pages it maps to be
>    limited to 4K RMP entries, but...
> b) What would happen normally[1] is the guest would issue PVALIDATE to
>    *rescind* the validated status of that 4K GPA before issuing the GHCB
>    request to convert it to shared. This would cause an
>    #NPF(RMP+SIZE_MISMATCH) and handle_rmp_page_fault() would PSMASH the RMP
>    entry so the PVALIDATE can succeed.
> 
>    So KVM doesn't even really have the option of deferring the PSMASH, it
>    happens before the SET_MEMORY_ATTRIBUTES is even issued to zap the 2MB
>    NPT mapping and switch the 2M range to 'mixed'. Because of that, we also
>    need a hook in the KVM MMU code to clamp the max mapping level based
>    on RMP entry size. Currently the kvm_gmem_prepare() in this patch
>    doubles for handling that clamping, so we would still need a similar
>    hook for that if we move the RMP initialization stuff to allocation
>    time.
> 
> [1] This handling is recommended for 'well-behaved' guests according to
> GHCB, but I don't see it documented as a hard requirement anywhere, so there
> is a possibility that that we have to deal with a guest that doesn't do this.
> What would happen then is the PVALIDATE wouldn't trigger the #NPF(RMP+SIZEM),
> and instead the SET_MEMORY_ATTRIBUTES would zap the 2MB mapping, install
> 4K entries on next #NPF(NOT_PRESENT), and at *that* point we would get
> an #NPF(RMP) without the SIZEM bit set, due to the behavior described in
> the beginning of this email.
> 
> handle_rmp_page_fault() can do the corresponding PSMASH to deal with that,
> but it is a little unfortunate since we can't differentiate that case from a
> spurious/unexpected RMP faults, so would need to attempt a PSMASH in all
> cases, sometimes failing.

The spurious case here is when the guest is accessing a private page
that's just been PSMASH'd by another thread. I thought these might still
occur before the PSMASH has completed so we'd still potentially see the
page-size bit set in the RMP entry, but the RMP faults only happen after
the PSMASH has finished, so the spurious cases can be filtered out by
just checking if the page-size bit is set before attempting a PSMASH.

> 
> gmem itself could also trigger this case if the lpage_info_slot() tracking
> ever became more granular than what the guest was expected (which I don't
> think would happen normally, but I may have hit one case where it does, but
> haven't had a chance to debug if that's on the lpage_info_slot() side or
> something else on the SNP end.

Turns out that was for a case where the shared/private attributes for the
2M range really were mixed at the time of access. In this case it is
when OVMF converts some shared memory to private during early boot: the
end address is not 2MB-aligned, so it is in a mixed region, and the NPT
mapping is 4K, but the RMP entry is initialized as 2MB. In this case the
PVALIDATE and NPT agree on the 4K mapping size so the SIZEM bit isn't set,
just the #NPF(RMP).

So we need to be able to deal with that even for 'well-behaved' guests.
With RMP-init-during-mapping-time approach I had some checks that avoided
creating the 2MB RMP entry in this mixed case which is why I didn't need
handling for this previously. But it's just one extra #NPF(RMP) and can
be handled cleanly since it can be distinguished from spurious cases.

-Mike

Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

Posted by Sean Christopherson 2 years, 5 months ago

On Tue, Aug 29, 2023, Michael Roth wrote:
> So we need to be able to deal with that even for 'well-behaved' guests.
> With RMP-init-during-mapping-time approach I had some checks that avoided
> creating the 2MB RMP entry in this mixed case which is why I didn't need
> handling for this previously. But it's just one extra #NPF(RMP) and can
> be handled cleanly since it can be distinguished from spurious cases.

Just to make sure you're not waiting on me for something, the TL;DR is that my
suggestion is viable and not too horrific?