KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

[PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Yosry Ahmed 2 weeks, 6 days ago

nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to
flush the TLB if VPID is enabled for both L1 and L2, but they still
share the TLB tag. This happens if EPT is disabled and KVM fails to
allocate a VPID for L2, so both the EPTP and VPID are shared between L1
and L2.

Interestingly, nested_vmx_transition_tlb_flush() uses
KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a
flush is required.

Taking a close look at vmx_flush_tlb_guest() and
vmx_flush_tlb_current(), the main differences are:
(a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid.
(b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of
INVVPID) to flush the guest-physical mappings as well as combined
mappings.

The check in (a) is seemingly an optimization, and there should not be
any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this
check in vmx_flush_tlb_guest() is not a fundamental difference, and it
can be added there separately if needed.

The difference in (b) is irrelevant in this case, because EPT being
enabled for L1 means that its TLB tags are tagged with EPTP and cannot
be used by L2 (regardless of whether or not EPT is enabled for L2).

Use KVM_REQ_TLB_FLUSH_GUEST in this case in
nested_vmx_transition_tlb_flush() for consistency. This arguably makes
more sense conceptually too -- L1 and L2 cannot share the TLB tag for
guest-physical translations, so only flushing linear and combined
translations (i.e. guest-generated translations) is needed.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---

I tested this by running all selftests that have "nested" in their name
(and not svm). I was tempted to run KVM-unit-tests in an L1 guest but I
convinced myself it's prompted by the change :)

---
 arch/x86/kvm/vmx/nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index aa78b6f38dfef..2ed454186e59c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1241,7 +1241,7 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
 	 * as the effective ASID is common to both L1 and L2.
 	 */
 	if (!nested_has_guest_tlb_tag(vcpu))
-		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
+		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 }
 
 static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
-- 
2.48.0.rc2.279.g1de40edade-goog

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Jim Mattson 2 weeks, 6 days ago

On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to
> flush the TLB if VPID is enabled for both L1 and L2, but they still
> share the TLB tag. This happens if EPT is disabled and KVM fails to
> allocate a VPID for L2, so both the EPTP and VPID are shared between L1
> and L2.

Nit: Combined and guest-physical TLB tags are based on EPTRTA (the new
acronym for EP4TA), not EPTP. But, in any case, with EPT disabled,
there are no combined or guest-physical mappings. There are only
linear mappings.

> Interestingly, nested_vmx_transition_tlb_flush() uses
> KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a
> flush is required.
>
> Taking a close look at vmx_flush_tlb_guest() and
> vmx_flush_tlb_current(), the main differences are:
> (a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid.
> (b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of
> INVVPID) to flush the guest-physical mappings as well as combined
> mappings.
>
> The check in (a) is seemingly an optimization, and there should not be
> any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this
> check in vmx_flush_tlb_guest() is not a fundamental difference, and it
> can be added there separately if needed.
>
> The difference in (b) is irrelevant in this case, because EPT being
> enabled for L1 means that its TLB tags are tagged with EPTP and cannot
> be used by L2 (regardless of whether or not EPT is enabled for L2).

The difference is also irrelevant because, as you concluded in the
first paragraph, EPT is disabled in the final block of
nested_vmx_transition_tlb_flush().

> Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> guest-physical translations, so only flushing linear and combined
> translations (i.e. guest-generated translations) is needed.

And, as I mentioned above, with EPT disabled, there are no combined or
guest-physical mappings.

> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

I think the reasoning in the commit message can be cleared up a bit, but...

Reviewed-by: Jim Mattson <mattson@google.com>

> ---
>
> I tested this by running all selftests that have "nested" in their name
> (and not svm). I was tempted to run KVM-unit-tests in an L1 guest but I
> convinced myself it's prompted by the change :)
>
> ---
>  arch/x86/kvm/vmx/nested.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index aa78b6f38dfef..2ed454186e59c 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1241,7 +1241,7 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
>          * as the effective ASID is common to both L1 and L2.
>          */
>         if (!nested_has_guest_tlb_tag(vcpu))
> -               kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> +               kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  }
>
>  static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
> --
> 2.48.0.rc2.279.g1de40edade-goog
>
>

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Yosry Ahmed 2 weeks, 5 days ago

On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
>
> On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to
> > flush the TLB if VPID is enabled for both L1 and L2, but they still
> > share the TLB tag. This happens if EPT is disabled and KVM fails to
> > allocate a VPID for L2, so both the EPTP and VPID are shared between L1
> > and L2.
>
> Nit: Combined and guest-physical TLB tags are based on EPTRTA (the new
> acronym for EP4TA), not EPTP. But, in any case, with EPT disabled,
> there are no combined or guest-physical mappings. There are only
> linear mappings.

Interestingly, I did initially write EPTRTA, but I changed it to EPTP
because that is the terminology used in nested_has_guest_tlb_tag().
Anyway, I definitely don't mind changing it to EPTRTA.

>
> > Interestingly, nested_vmx_transition_tlb_flush() uses
> > KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a
> > flush is required.
> >
> > Taking a close look at vmx_flush_tlb_guest() and
> > vmx_flush_tlb_current(), the main differences are:
> > (a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid.
> > (b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of
> > INVVPID) to flush the guest-physical mappings as well as combined
> > mappings.
> >
> > The check in (a) is seemingly an optimization, and there should not be
> > any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this
> > check in vmx_flush_tlb_guest() is not a fundamental difference, and it
> > can be added there separately if needed.
> >
> > The difference in (b) is irrelevant in this case, because EPT being
> > enabled for L1 means that its TLB tags are tagged with EPTP and cannot
> > be used by L2 (regardless of whether or not EPT is enabled for L2).
>
> The difference is also irrelevant because, as you concluded in the
> first paragraph, EPT is disabled in the final block of
> nested_vmx_transition_tlb_flush().

I was trying to explain that even if EPT is enabled, sharing
guest-physical translations between L1 and L2 should never be possible
(and hence we should never worry about flushing these translations in
nested_vmx_transition_tlb_flush()).  Now that I read it again it is
not as clear as I had hoped.

>
> > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > guest-physical translations, so only flushing linear and combined
> > translations (i.e. guest-generated translations) is needed.
>
> And, as I mentioned above, with EPT disabled, there are no combined or
> guest-physical mappings.
>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> I think the reasoning in the commit message can be cleared up a bit, but...

Agreed :) I am sure Sean will also want changes in the commit message anyway.

>
> Reviewed-by: Jim Mattson <mattson@google.com>

Thanks for the quick review!

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Sean Christopherson 2 weeks, 5 days ago

On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
> > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > > guest-physical translations, so only flushing linear and combined
> > > translations (i.e. guest-generated translations) is needed.

No, using KVM_REQ_TLB_FLUSH_CURRENT is correct.  From *L1's* perspective, VPID
is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush
TLBs, and thus KVM is not required to FLUSH_GUEST.

E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the
PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed
to retain its old, "stale" SPTEs that map L2 because architecturally they aren't
guaranteed to be visible to L2.

But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the
hardware TLBs are flushed.  Without EPT, KVM will use different CR3s for L1 and
L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which
KVM always pulls from guest CR3, i.e. could be the same for L1 and L2.

Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest()
is not required in this scenario.

  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
  {
	++vcpu->stat.tlb_flush;

	if (!tdp_enabled) {
		/*
		 * A TLB flush on behalf of the guest is equivalent to
		 * INVPCID(all), toggling CR4.PGE, etc., which requires
		 * a forced sync of the shadow page tables.  Ensure all the
		 * roots are synced and the guest TLB in hardware is clean.
		 */
		kvm_mmu_sync_roots(vcpu);
		kvm_mmu_sync_prev_roots(vcpu);
	}

	kvm_x86_call(flush_tlb_guest)(vcpu);

	/*
	 * Flushing all "guest" TLB is always a superset of Hyper-V's fine
	 * grained flushing.
	 */
	kvm_hv_vcpu_purge_flush_tlb(vcpu);
  }

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Yosry Ahmed 2 weeks, 5 days ago

On Thu, Jan 16, 2025 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
> > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > > > guest-physical translations, so only flushing linear and combined
> > > > translations (i.e. guest-generated translations) is needed.
>
> No, using KVM_REQ_TLB_FLUSH_CURRENT is correct.  From *L1's* perspective, VPID
> is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush
> TLBs, and thus KVM is not required to FLUSH_GUEST.
>
> E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the
> PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed
> to retain its old, "stale" SPTEs that map L2 because architecturally they aren't
> guaranteed to be visible to L2.
>
> But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the
> hardware TLBs are flushed.  Without EPT, KVM will use different CR3s for L1 and
> L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which
> KVM always pulls from guest CR3, i.e. could be the same for L1 and L2.
>
> Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest()
> is not required in this scenario.

Aha, I was examining vmx_flush_tlb_guest() not
kvm_vcpu_flush_tlb_guest(), so I missed the synchronization. Yeah I
think it's possible that we end up unnecessarily synchronizing the
shadow page tables (or dropping them) in this case.

Do you think it's worth expanding the comment in
nested_vmx_transition_tlb_flush()?

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2ed454186e59c..43d34e413d016 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1239,6 +1239,11 @@ static void
nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
         * does not have a unique TLB tag (ASID), i.e. EPT is disabled and
         * KVM was unable to allocate a VPID for L2, flush the current context
         * as the effective ASID is common to both L1 and L2.
+        *
+        * Note that even though TLB_FLUSH_GUEST would be correct because we
+        * only need to flush linear mappings, it would unnecessarily
+        * synchronize the MMU even though a TLB flush is not architecturally
+        * required from L1's perspective.
         */
        if (!nested_has_guest_tlb_tag(vcpu))
                kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Sean Christopherson 2 weeks, 5 days ago

On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> On Thu, Jan 16, 2025 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
> > > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > > > > guest-physical translations, so only flushing linear and combined
> > > > > translations (i.e. guest-generated translations) is needed.
> >
> > No, using KVM_REQ_TLB_FLUSH_CURRENT is correct.  From *L1's* perspective, VPID
> > is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush
> > TLBs, and thus KVM is not required to FLUSH_GUEST.
> >
> > E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the
> > PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed
> > to retain its old, "stale" SPTEs that map L2 because architecturally they aren't
> > guaranteed to be visible to L2.
> >
> > But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the
> > hardware TLBs are flushed.  Without EPT, KVM will use different CR3s for L1 and
> > L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which
> > KVM always pulls from guest CR3, i.e. could be the same for L1 and L2.
> >
> > Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest()
> > is not required in this scenario.
> 
> Aha, I was examining vmx_flush_tlb_guest() not
> kvm_vcpu_flush_tlb_guest(), so I missed the synchronization. Yeah I
> think it's possible that we end up unnecessarily synchronizing the
> shadow page tables (or dropping them) in this case.
> 
> Do you think it's worth expanding the comment in
> nested_vmx_transition_tlb_flush()?
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 2ed454186e59c..43d34e413d016 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1239,6 +1239,11 @@ static void
> nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
>          * does not have a unique TLB tag (ASID), i.e. EPT is disabled and
>          * KVM was unable to allocate a VPID for L2, flush the current context
>          * as the effective ASID is common to both L1 and L2.
> +        *
> +        * Note that even though TLB_FLUSH_GUEST would be correct because we
> +        * only need to flush linear mappings, it would unnecessarily
> +        * synchronize the MMU even though a TLB flush is not architecturally
> +        * required from L1's perspective.

I'm open to calling out that there's no flush from L1's perspective, but this
is inaccurate.  Using TLB_FLUSH_GUEST is simply not correct.  Will it cause
functional problems?  No.  But neither would blasting kvm_flush_remote_tlbs(),
and I think most people would consider flushing all TLBs on all vCPUs to be a
bug.

How about:

	 * Note, only the hardware TLB entries need to be flushed, as VPID is
	 * fully enabled from L1's perspective, i.e. there's no architectural
	 * TLB flush from L1's perspective.

>          */
>         if (!nested_has_guest_tlb_tag(vcpu))
>                 kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Yosry Ahmed 2 weeks, 5 days ago

On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > On Thu, Jan 16, 2025 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > > > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote:
> > > > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in
> > > > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes
> > > > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for
> > > > > > guest-physical translations, so only flushing linear and combined
> > > > > > translations (i.e. guest-generated translations) is needed.
> > >
> > > No, using KVM_REQ_TLB_FLUSH_CURRENT is correct.  From *L1's* perspective, VPID
> > > is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush
> > > TLBs, and thus KVM is not required to FLUSH_GUEST.
> > >
> > > E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the
> > > PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed
> > > to retain its old, "stale" SPTEs that map L2 because architecturally they aren't
> > > guaranteed to be visible to L2.
> > >
> > > But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the
> > > hardware TLBs are flushed.  Without EPT, KVM will use different CR3s for L1 and
> > > L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which
> > > KVM always pulls from guest CR3, i.e. could be the same for L1 and L2.
> > >
> > > Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest()
> > > is not required in this scenario.
> >
> > Aha, I was examining vmx_flush_tlb_guest() not
> > kvm_vcpu_flush_tlb_guest(), so I missed the synchronization. Yeah I
> > think it's possible that we end up unnecessarily synchronizing the
> > shadow page tables (or dropping them) in this case.
> >
> > Do you think it's worth expanding the comment in
> > nested_vmx_transition_tlb_flush()?
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 2ed454186e59c..43d34e413d016 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -1239,6 +1239,11 @@ static void
> > nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
> >          * does not have a unique TLB tag (ASID), i.e. EPT is disabled and
> >          * KVM was unable to allocate a VPID for L2, flush the current context
> >          * as the effective ASID is common to both L1 and L2.
> > +        *
> > +        * Note that even though TLB_FLUSH_GUEST would be correct because we
> > +        * only need to flush linear mappings, it would unnecessarily
> > +        * synchronize the MMU even though a TLB flush is not architecturally
> > +        * required from L1's perspective.
>
> I'm open to calling out that there's no flush from L1's perspective, but this
> is inaccurate.  Using TLB_FLUSH_GUEST is simply not correct.  Will it cause
> functional problems?  No.  But neither would blasting kvm_flush_remote_tlbs(),
> and I think most people would consider flushing all TLBs on all vCPUs to be a
> bug.

Yeah I meant functionally correct as it does not cause correctness
issues, but definitely a problem.

>
> How about:
>
>          * Note, only the hardware TLB entries need to be flushed, as VPID is
>          * fully enabled from L1's perspective, i.e. there's no architectural
>          * TLB flush from L1's perspective.

I hate to bikeshed, but I want to explicitly call out that we do not
need to synchronize the MMU. Maybe this?

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2ed454186e59c..a9171909de63d 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1239,6 +1239,11 @@ static void
nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
         * does not have a unique TLB tag (ASID), i.e. EPT is disabled and
         * KVM was unable to allocate a VPID for L2, flush the current context
         * as the effective ASID is common to both L1 and L2.
+        *
+        * Note, only the hardware TLB entries need to be flushed, as VPID is
+        * fully enabled from L1's perspective, i.e. there's no
+        * architectural TLB flush from L1's perspective. Hence, synchronizing
+        * the MMU is not required as the mappings are still valid.
         */
        if (!nested_has_guest_tlb_tag(vcpu))
                kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Sean Christopherson 2 weeks, 5 days ago

On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > How about:
> >
> >          * Note, only the hardware TLB entries need to be flushed, as VPID is
> >          * fully enabled from L1's perspective, i.e. there's no architectural
> >          * TLB flush from L1's perspective.
> 
> I hate to bikeshed, but I want to explicitly call out that we do not
> need to synchronize the MMU.

Why?  Honest question, I want to understand what's unclear.  My hesitation to
talk about synchronizing MMUs is that it brings things into play that aren't
super relevant to this specific code, and might even add confusion.  Specifically,
kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but
the fact that this path is reachable if and only if EPT is enabled is completely
coincidental.

E.g. very hypothetically, if KVM used the same EPT root (I already forgot Intel's
new acronym) for L1 and L2, then this would no longer be true:

 * If L0 uses EPT, L1 and L2 run with different EPTP because
 * guest_mode is part of kvm_mmu_page_role. Thus, TLB entries
 * are tagged with different EPTP.

L1 vs. L2 EPT usage would no longer use separate ASID tags, and so KVM would
need to FLUSH_CURRENT on transitions in most cases, e.g. to purge APICv mappings.

The comment above !nested_cpu_has_vpid() talks at length about synchronizing MMUs
because the EPT behavior in particular is subtle and rather unintuitive.  I.e.
the comment is much more about NOT using KVM_REQ_MMU_SYNC than it is about using
KVM_REQ_TLB_FLUSH_GUEST.

> Maybe this?
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 2ed454186e59c..a9171909de63d 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1239,6 +1239,11 @@ static void
> nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
>          * does not have a unique TLB tag (ASID), i.e. EPT is disabled and
>          * KVM was unable to allocate a VPID for L2, flush the current context
>          * as the effective ASID is common to both L1 and L2.
> +        *
> +        * Note, only the hardware TLB entries need to be flushed, as VPID is
> +        * fully enabled from L1's perspective, i.e. there's no
> +        * architectural TLB flush from L1's perspective. Hence, synchronizing
> +        * the MMU is not required as the mappings are still valid.
>          */
>         if (!nested_has_guest_tlb_tag(vcpu))
>                 kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Yosry Ahmed 2 weeks, 5 days ago

On Thu, Jan 16, 2025 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > How about:
> > >
> > >          * Note, only the hardware TLB entries need to be flushed, as VPID is
> > >          * fully enabled from L1's perspective, i.e. there's no architectural
> > >          * TLB flush from L1's perspective.
> >
> > I hate to bikeshed, but I want to explicitly call out that we do not
> > need to synchronize the MMU.
>
> Why?  Honest question, I want to understand what's unclear.  My hesitation to
> talk about synchronizing MMUs is that it brings things into play that aren't
> super relevant to this specific code, and might even add confusion.  Specifically,
> kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but
> the fact that this path is reachable if and only if EPT is enabled is completely
> coincidental.

Personally, the main thing that was unclear to me and I wanted to add
a comment to clarify was why we use KVM_REQ_TLB_FLUSH_GUEST in the
first two cases but KVM_REQ_TLB_FLUSH_CURRENT in the last one.

Here's my understanding:

In the first case (i.e. !nested_cpu_has_vpid(vmcs12)), the flush is
architecturally required from L1's perspective to we need to flush
guest-generated TLB entries (and potentially synchronize KVM's MMU).

In the second case, KVM does not track the history of VPID12, so the
flush *may* be architecturally required from L1's perspective, so we
do the same thing.

In the last case though, the flush is NOT architecturally required
from L1's perspective, it's just an artifact of KVM's potential
failure to allocate a dedicated VPID for L2 despite L1 asking for it.

So ultimately, I don't want to specifically call out synchronizing
MMUs, as much as I want to call out why this case uses
KVM_REQ_TLB_FLUSH_CURRENT and not KVM_REQ_TLB_FLUSH_GUEST like the
others. I only suggested calling out the MMU synchronization since
it's effectively the only difference between the two in this case.

I am open to any wording you think is best. I am also fine with just
dropping this completely, definitely not the hill to die on :)

>
> E.g. very hypothetically, if KVM used the same EPT root (I already forgot Intel's
> new acronym) for L1 and L2, then this would no longer be true:
>
>  * If L0 uses EPT, L1 and L2 run with different EPTP because
>  * guest_mode is part of kvm_mmu_page_role. Thus, TLB entries
>  * are tagged with different EPTP.
>
> L1 vs. L2 EPT usage would no longer use separate ASID tags, and so KVM would
> need to FLUSH_CURRENT on transitions in most cases, e.g. to purge APICv mappings.
>
> The comment above !nested_cpu_has_vpid() talks at length about synchronizing MMUs
> because the EPT behavior in particular is subtle and rather unintuitive.  I.e.
> the comment is much more about NOT using KVM_REQ_MMU_SYNC than it is about using
> KVM_REQ_TLB_FLUSH_GUEST.

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Sean Christopherson 2 weeks, 4 days ago

On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> On Thu, Jan 16, 2025 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > > On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > How about:
> > > >
> > > >          * Note, only the hardware TLB entries need to be flushed, as VPID is
> > > >          * fully enabled from L1's perspective, i.e. there's no architectural
> > > >          * TLB flush from L1's perspective.
> > >
> > > I hate to bikeshed, but I want to explicitly call out that we do not
> > > need to synchronize the MMU.
> >
> > Why?  Honest question, I want to understand what's unclear.  My hesitation to
> > talk about synchronizing MMUs is that it brings things into play that aren't
> > super relevant to this specific code, and might even add confusion.  Specifically,
> > kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but
> > the fact that this path is reachable if and only if EPT is enabled is completely
> > coincidental.
> 
> Personally, the main thing that was unclear to me and I wanted to add
> a comment to clarify was why we use KVM_REQ_TLB_FLUSH_GUEST in the
> first two cases but KVM_REQ_TLB_FLUSH_CURRENT in the last one.
>
> Here's my understanding:
> 
> In the first case (i.e. !nested_cpu_has_vpid(vmcs12)), the flush is
> architecturally required from L1's perspective to we need to flush
> guest-generated TLB entries (and potentially synchronize KVM's MMU).
> 
> In the second case, KVM does not track the history of VPID12, so the
> flush *may* be architecturally required from L1's perspective, so we
> do the same thing.
> 
> In the last case though, the flush is NOT architecturally required
> from L1's perspective, it's just an artifact of KVM's potential
> failure to allocate a dedicated VPID for L2 despite L1 asking for it.
> 
> So ultimately, I don't want to specifically call out synchronizing
> MMUs, as much as I want to call out why this case uses
> KVM_REQ_TLB_FLUSH_CURRENT and not KVM_REQ_TLB_FLUSH_GUEST like the
> others. I only suggested calling out the MMU synchronization since
> it's effectively the only difference between the two in this case.

Yep.  I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and
TLB_FLUSH_CURRENT.  I'm not entirely sure where it would be best to document
them.  I guess maybe where they are #defined?

TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's
perspective, is architecturally required.  The one oddity with TLB_FLUSH_GUEST
is that it does NOT include guest-physical mappings, i.e. TLB entries that are
associated with an EPT root.

TLB_FLUSH_CURRENT is used when KVM needs to flush the hardware TLB(s) for the
current CPU+context.  The most "obvious" case is for when KVM has modified its
page tables.  More subtle cases are things like changing which physical CPU the
vCPU is running on, and this case where KVM is switching the shadow CR3, VPID is
enabled in the host (i.e. hardware won't flush TLBs), and the L1 vs. L2 shadow
CR3s are not tagged (i.e. use the same hardware VPID).

The use of TLB_FLUSH_GUEST *may* result in an MMU sync, but that's a side effect
of an architectural guest TLB flush occurring, not the other way around.

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Yosry Ahmed 2 weeks, 4 days ago

On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > On Thu, Jan 16, 2025 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Jan 16, 2025, Yosry Ahmed wrote:
> > > > On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > How about:
> > > > >
> > > > >          * Note, only the hardware TLB entries need to be flushed, as VPID is
> > > > >          * fully enabled from L1's perspective, i.e. there's no architectural
> > > > >          * TLB flush from L1's perspective.
> > > >
> > > > I hate to bikeshed, but I want to explicitly call out that we do not
> > > > need to synchronize the MMU.
> > >
> > > Why?  Honest question, I want to understand what's unclear.  My hesitation to
> > > talk about synchronizing MMUs is that it brings things into play that aren't
> > > super relevant to this specific code, and might even add confusion.  Specifically,
> > > kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but
> > > the fact that this path is reachable if and only if EPT is enabled is completely
> > > coincidental.
> >
> > Personally, the main thing that was unclear to me and I wanted to add
> > a comment to clarify was why we use KVM_REQ_TLB_FLUSH_GUEST in the
> > first two cases but KVM_REQ_TLB_FLUSH_CURRENT in the last one.
> >
> > Here's my understanding:
> >
> > In the first case (i.e. !nested_cpu_has_vpid(vmcs12)), the flush is
> > architecturally required from L1's perspective to we need to flush
> > guest-generated TLB entries (and potentially synchronize KVM's MMU).
> >
> > In the second case, KVM does not track the history of VPID12, so the
> > flush *may* be architecturally required from L1's perspective, so we
> > do the same thing.
> >
> > In the last case though, the flush is NOT architecturally required
> > from L1's perspective, it's just an artifact of KVM's potential
> > failure to allocate a dedicated VPID for L2 despite L1 asking for it.
> >
> > So ultimately, I don't want to specifically call out synchronizing
> > MMUs, as much as I want to call out why this case uses
> > KVM_REQ_TLB_FLUSH_CURRENT and not KVM_REQ_TLB_FLUSH_GUEST like the
> > others. I only suggested calling out the MMU synchronization since
> > it's effectively the only difference between the two in this case.
>
> Yep.  I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and
> TLB_FLUSH_CURRENT.  I'm not entirely sure where it would be best to document
> them.  I guess maybe where they are #defined?

I guess at the #define we can just mention that they result in calling
kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if
anything.

The specific documentation about what they do could be above the
functions themselves, and describing the potential MMU sync is
naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda
already there).

The flush_tlb_guest() callback is documented in kvm_host.h, but not
flush_tlb_current(). I was going to suggest just documenting that. But
kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but
it also potentially synchronizes the MMU. So only documenting the
callbacks does not paint a full picture.

FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*()
functions are more-or-less thin wrappers around the per-vendor
callbacks -- except kvm_vcpu_flush_tlb_guest().

>
> TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's
> perspective, is architecturally required.  The one oddity with TLB_FLUSH_GUEST
> is that it does NOT include guest-physical mappings, i.e. TLB entries that are
> associated with an EPT root.

The way I think about this is how it's documented above the per-vendor
callback. It flushes translations created by the guest. The guest does
not (directly) create guest-physical translations, only linear and
combined translations.

>
> TLB_FLUSH_CURRENT is used when KVM needs to flush the hardware TLB(s) for the
> current CPU+context.  The most "obvious" case is for when KVM has modified its
> page tables.  More subtle cases are things like changing which physical CPU the
> vCPU is running on, and this case where KVM is switching the shadow CR3, VPID is
> enabled in the host (i.e. hardware won't flush TLBs), and the L1 vs. L2 shadow
> CR3s are not tagged (i.e. use the same hardware VPID).
>
> The use of TLB_FLUSH_GUEST *may* result in an MMU sync, but that's a side effect
> of an architectural guest TLB flush occurring, not the other way around.

Agreed.

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Sean Christopherson 2 weeks, 4 days ago

On Fri, Jan 17, 2025, Yosry Ahmed wrote:
> On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote:
> > Yep.  I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and
> > TLB_FLUSH_CURRENT.  I'm not entirely sure where it would be best to document
> > them.  I guess maybe where they are #defined?
> 
> I guess at the #define we can just mention that they result in calling
> kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if
> anything.

Yeah, a "See xx for details" redirect is probably the best option.

> The specific documentation about what they do could be above the
> functions themselves, and describing the potential MMU sync is
> naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda
> already there).
> 
> The flush_tlb_guest() callback is documented in kvm_host.h, but not
> flush_tlb_current(). I was going to suggest just documenting that. But
> kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but
> it also potentially synchronizes the MMU. So only documenting the
> callbacks does not paint a full picture.
> 
> FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*()
> functions are more-or-less thin wrappers around the per-vendor
> callbacks -- except kvm_vcpu_flush_tlb_guest().
> 
> >
> > TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's
> > perspective, is architecturally required.  The one oddity with TLB_FLUSH_GUEST
> > is that it does NOT include guest-physical mappings, i.e. TLB entries that are
> > associated with an EPT root.
> 
> The way I think about this is how it's documented above the per-vendor
> callback. It flushes translations created by the guest. The guest does
> not (directly) create guest-physical translations, only linear and
> combined translations.

That's not accurate either.  When L1 is using nested TDP, it does create guest-
physical translations.  The lack of any form of handling in TLB_FLUSH_GUEST is
a reflection of two things: EPT is weird, and nested SVM doesn't yet support
precise flushing on transitions, i.e. nested NPT handling is missing because KVM
unconditionally flushes and synchronizes.

EPT is "weird" because the _only_ time guest-physical translations are flushed
is when the "wrong" KVM MMU is loaded.  The only way to flush guest-physical
translations (short of RESET :-D) is via INVEPT, and INVEPT is a root-only (VMX
terminology) instruction, i.e. can only be executed by L1.  And because L1 can't
itself be using EPT[*], INVEPT can never target/flush the current context.

Furthermore, INVEPT isn't strictly tied to a VMCS, e.g. deferring the emulated
flush until the next time KVM runs a vmcs12 isn't viable.  Rather than add
dedicated tracking, KVM simply unloads the roots and lets the normal root
"allocation" handle the flush+sync the next time the vCPU uses the associated MMU.

Nested NPT is different, as there is no INVNPT.  Instead, there's the ASID itself
and a flushing control, both of which are properties of the VMCB.  As a result,
NPT TLB flushes that are initiated by a hypervisor always take effect at VMRUN,
e.g. by bumping the ASID, or via the dedicated flushing control.

So when proper handling of TLB flushing on nested SVM transition comes along, I
do expect that either kvm_vcpu_flush_tlb_guest() will grow.  Or maybe we'll add
yet another TLB_FLUSH_XXX flavor :-)

One thing that could be helpful would be to document that KVM doesn't use
TLB_FLUSH_GUEST to handle INVEPT, and so there's no need to sync nested TDP MMUs.

[*] Even in a deprivileged scenario like pKVM, the guest kernel would become L2
    from KVM's perspective.

Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

Posted by Yosry Ahmed 1 week, 6 days ago

On Fri, Jan 17, 2025 at 4:03 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 17, 2025, Yosry Ahmed wrote:
> > On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote:
> > > Yep.  I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and
> > > TLB_FLUSH_CURRENT.  I'm not entirely sure where it would be best to document
> > > them.  I guess maybe where they are #defined?
> >
> > I guess at the #define we can just mention that they result in calling
> > kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if
> > anything.
>
> Yeah, a "See xx for details" redirect is probably the best option.
>
> > The specific documentation about what they do could be above the
> > functions themselves, and describing the potential MMU sync is
> > naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda
> > already there).
> >
> > The flush_tlb_guest() callback is documented in kvm_host.h, but not
> > flush_tlb_current(). I was going to suggest just documenting that. But
> > kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but
> > it also potentially synchronizes the MMU. So only documenting the
> > callbacks does not paint a full picture.
> >
> > FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*()
> > functions are more-or-less thin wrappers around the per-vendor
> > callbacks -- except kvm_vcpu_flush_tlb_guest().
> >
> > >
> > > TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's
> > > perspective, is architecturally required.  The one oddity with TLB_FLUSH_GUEST
> > > is that it does NOT include guest-physical mappings, i.e. TLB entries that are
> > > associated with an EPT root.
> >
> > The way I think about this is how it's documented above the per-vendor
> > callback. It flushes translations created by the guest. The guest does
> > not (directly) create guest-physical translations, only linear and
> > combined translations.
>
> That's not accurate either.  When L1 is using nested TDP, it does create guest-
> physical translations.

Ah yes, I missed that.

> The lack of any form of handling in TLB_FLUSH_GUEST is
> a reflection of two things: EPT is weird, and nested SVM doesn't yet support
> precise flushing on transitions, i.e. nested NPT handling is missing because KVM
> unconditionally flushes and synchronizes.

Even for nested NPT, we shouldn't flush/synchronize the guest-physical
mappings unless L1 specifies that in VMCB12. So I suspect
TLB_FLUSH_GUEST will remain ignorant to nested guest-physical mappings
anyway.

>
> EPT is "weird" because the _only_ time guest-physical translations are flushed
> is when the "wrong" KVM MMU is loaded.  The only way to flush guest-physical
> translations (short of RESET :-D) is via INVEPT, and INVEPT is a root-only (VMX
> terminology) instruction, i.e. can only be executed by L1.  And because L1 can't
> itself be using EPT[*], INVEPT can never target/flush the current context

I think you meant L0 here.

>
> Furthermore, INVEPT isn't strictly tied to a VMCS, e.g. deferring the emulated
> flush until the next time KVM runs a vmcs12 isn't viable. Rather than add
> dedicated tracking, KVM simply unloads the roots and lets the normal root
> "allocation" handle the flush+sync the next time the vCPU uses the associated MMU.

I think you are referring to what you mentioned to me offline, that
INVEPT could be affecting more than one cached root in KVM's MMU, not
just the loaded root when running VMCS12, which is why we cannot defer
synchronizing or unloading the roots.

>
> Nested NPT is different, as there is no INVNPT.  Instead, there's the ASID itself
> and a flushing control, both of which are properties of the VMCB.  As a result,
> NPT TLB flushes that are initiated by a hypervisor always take effect at VMRUN,
> e.g. by bumping the ASID, or via the dedicated flushing control.
>
> So when proper handling of TLB flushing on nested SVM transition comes along, I
> do expect that either kvm_vcpu_flush_tlb_guest() will grow.  Or maybe we'll add
> yet another TLB_FLUSH_XXX flavor :-)

I think kvm_vcpu_flush_tlb_guest() should remain ignorant because it's
used for a lot of operations flushing the TLB on behalf of the guest,
which should not flush nested guest-physical translations. Anyway, I
am trying to come up with a PoC for the "proper handling", so we'll
see how that turns out :)

>
> One thing that could be helpful would be to document that KVM doesn't use
> TLB_FLUSH_GUEST to handle INVEPT, and so there's no need to sync nested TDP MMUs.

I think what to document may become clearer when the nSVM
implementation comes along.

>
> [*] Even in a deprivileged scenario like pKVM, the guest kernel would become L2
>     from KVM's perspective.