arch/x86/kvm/vmx/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to
flush the TLB if VPID is enabled for both L1 and L2, but they still
share the TLB tag. This happens if EPT is disabled and KVM fails to
allocate a VPID for L2, so both the EPTP and VPID are shared between L1
and L2.
Interestingly, nested_vmx_transition_tlb_flush() uses
KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a
flush is required.
Taking a close look at vmx_flush_tlb_guest() and
vmx_flush_tlb_current(), the main differences are:
(a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid.
(b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of
INVVPID) to flush the guest-physical mappings as well as combined
mappings.
The check in (a) is seemingly an optimization, and there should not be
any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this
check in vmx_flush_tlb_guest() is not a fundamental difference, and it
can be added there separately if needed.
The difference in (b) is irrelevant in this case, because EPT being
enabled for L1 means that its TLB tags are tagged with EPTP and cannot
be used by L2 (regardless of whether or not EPT is enabled for L2).
Use KVM_REQ_TLB_FLUSH_GUEST in this case in
nested_vmx_transition_tlb_flush() for consistency. This arguably makes
more sense conceptually too -- L1 and L2 cannot share the TLB tag for
guest-physical translations, so only flushing linear and combined
translations (i.e. guest-generated translations) is needed.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
I tested this by running all selftests that have "nested" in their name
(and not svm). I was tempted to run KVM-unit-tests in an L1 guest but I
convinced myself it's prompted by the change :)
---
arch/x86/kvm/vmx/nested.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index aa78b6f38dfef..2ed454186e59c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1241,7 +1241,7 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu,
* as the effective ASID is common to both L1 and L2.
*/
if (!nested_has_guest_tlb_tag(vcpu))
- kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
+ kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}
static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
--
2.48.0.rc2.279.g1de40edade-goog
On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to > flush the TLB if VPID is enabled for both L1 and L2, but they still > share the TLB tag. This happens if EPT is disabled and KVM fails to > allocate a VPID for L2, so both the EPTP and VPID are shared between L1 > and L2. Nit: Combined and guest-physical TLB tags are based on EPTRTA (the new acronym for EP4TA), not EPTP. But, in any case, with EPT disabled, there are no combined or guest-physical mappings. There are only linear mappings. > Interestingly, nested_vmx_transition_tlb_flush() uses > KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a > flush is required. > > Taking a close look at vmx_flush_tlb_guest() and > vmx_flush_tlb_current(), the main differences are: > (a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid. > (b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of > INVVPID) to flush the guest-physical mappings as well as combined > mappings. > > The check in (a) is seemingly an optimization, and there should not be > any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this > check in vmx_flush_tlb_guest() is not a fundamental difference, and it > can be added there separately if needed. > > The difference in (b) is irrelevant in this case, because EPT being > enabled for L1 means that its TLB tags are tagged with EPTP and cannot > be used by L2 (regardless of whether or not EPT is enabled for L2). The difference is also irrelevant because, as you concluded in the first paragraph, EPT is disabled in the final block of nested_vmx_transition_tlb_flush(). > Use KVM_REQ_TLB_FLUSH_GUEST in this case in > nested_vmx_transition_tlb_flush() for consistency. This arguably makes > more sense conceptually too -- L1 and L2 cannot share the TLB tag for > guest-physical translations, so only flushing linear and combined > translations (i.e. guest-generated translations) is needed. And, as I mentioned above, with EPT disabled, there are no combined or guest-physical mappings. > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> I think the reasoning in the commit message can be cleared up a bit, but... Reviewed-by: Jim Mattson <mattson@google.com> > --- > > I tested this by running all selftests that have "nested" in their name > (and not svm). I was tempted to run KVM-unit-tests in an L1 guest but I > convinced myself it's prompted by the change :) > > --- > arch/x86/kvm/vmx/nested.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index aa78b6f38dfef..2ed454186e59c 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -1241,7 +1241,7 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, > * as the effective ASID is common to both L1 and L2. > */ > if (!nested_has_guest_tlb_tag(vcpu)) > - kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); > + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); > } > > static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask) > -- > 2.48.0.rc2.279.g1de40edade-goog > >
On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote: > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > nested_vmx_transition_tlb_flush() uses KVM_REQ_TLB_FLUSH_CURRENT to > > flush the TLB if VPID is enabled for both L1 and L2, but they still > > share the TLB tag. This happens if EPT is disabled and KVM fails to > > allocate a VPID for L2, so both the EPTP and VPID are shared between L1 > > and L2. > > Nit: Combined and guest-physical TLB tags are based on EPTRTA (the new > acronym for EP4TA), not EPTP. But, in any case, with EPT disabled, > there are no combined or guest-physical mappings. There are only > linear mappings. Interestingly, I did initially write EPTRTA, but I changed it to EPTP because that is the terminology used in nested_has_guest_tlb_tag(). Anyway, I definitely don't mind changing it to EPTRTA. > > > Interestingly, nested_vmx_transition_tlb_flush() uses > > KVM_REQ_TLB_FLUSH_GUEST to flush the TLB for all other cases where a > > flush is required. > > > > Taking a close look at vmx_flush_tlb_guest() and > > vmx_flush_tlb_current(), the main differences are: > > (a) vmx_flush_tlb_current() is a noop if the KVM MMU is invalid. > > (b) vmx_flush_tlb_current() uses INVEPT if EPT is enabled (instead of > > INVVPID) to flush the guest-physical mappings as well as combined > > mappings. > > > > The check in (a) is seemingly an optimization, and there should not be > > any TLB entries for L1 anyway if the KVM MMU is invalid. Not having this > > check in vmx_flush_tlb_guest() is not a fundamental difference, and it > > can be added there separately if needed. > > > > The difference in (b) is irrelevant in this case, because EPT being > > enabled for L1 means that its TLB tags are tagged with EPTP and cannot > > be used by L2 (regardless of whether or not EPT is enabled for L2). > > The difference is also irrelevant because, as you concluded in the > first paragraph, EPT is disabled in the final block of > nested_vmx_transition_tlb_flush(). I was trying to explain that even if EPT is enabled, sharing guest-physical translations between L1 and L2 should never be possible (and hence we should never worry about flushing these translations in nested_vmx_transition_tlb_flush()). Now that I read it again it is not as clear as I had hoped. > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for > > guest-physical translations, so only flushing linear and combined > > translations (i.e. guest-generated translations) is needed. > > And, as I mentioned above, with EPT disabled, there are no combined or > guest-physical mappings. > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > I think the reasoning in the commit message can be cleared up a bit, but... Agreed :) I am sure Sean will also want changes in the commit message anyway. > > Reviewed-by: Jim Mattson <mattson@google.com> Thanks for the quick review!
On Thu, Jan 16, 2025, Yosry Ahmed wrote: > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote: > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for > > > guest-physical translations, so only flushing linear and combined > > > translations (i.e. guest-generated translations) is needed. No, using KVM_REQ_TLB_FLUSH_CURRENT is correct. From *L1's* perspective, VPID is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush TLBs, and thus KVM is not required to FLUSH_GUEST. E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed to retain its old, "stale" SPTEs that map L2 because architecturally they aren't guaranteed to be visible to L2. But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the hardware TLBs are flushed. Without EPT, KVM will use different CR3s for L1 and L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which KVM always pulls from guest CR3, i.e. could be the same for L1 and L2. Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest() is not required in this scenario. static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu) { ++vcpu->stat.tlb_flush; if (!tdp_enabled) { /* * A TLB flush on behalf of the guest is equivalent to * INVPCID(all), toggling CR4.PGE, etc., which requires * a forced sync of the shadow page tables. Ensure all the * roots are synced and the guest TLB in hardware is clean. */ kvm_mmu_sync_roots(vcpu); kvm_mmu_sync_prev_roots(vcpu); } kvm_x86_call(flush_tlb_guest)(vcpu); /* * Flushing all "guest" TLB is always a superset of Hyper-V's fine * grained flushing. */ kvm_hv_vcpu_purge_flush_tlb(vcpu); }
On Thu, Jan 16, 2025 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote: > > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in > > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes > > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for > > > > guest-physical translations, so only flushing linear and combined > > > > translations (i.e. guest-generated translations) is needed. > > No, using KVM_REQ_TLB_FLUSH_CURRENT is correct. From *L1's* perspective, VPID > is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush > TLBs, and thus KVM is not required to FLUSH_GUEST. > > E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the > PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed > to retain its old, "stale" SPTEs that map L2 because architecturally they aren't > guaranteed to be visible to L2. > > But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the > hardware TLBs are flushed. Without EPT, KVM will use different CR3s for L1 and > L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which > KVM always pulls from guest CR3, i.e. could be the same for L1 and L2. > > Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest() > is not required in this scenario. Aha, I was examining vmx_flush_tlb_guest() not kvm_vcpu_flush_tlb_guest(), so I missed the synchronization. Yeah I think it's possible that we end up unnecessarily synchronizing the shadow page tables (or dropping them) in this case. Do you think it's worth expanding the comment in nested_vmx_transition_tlb_flush()? diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 2ed454186e59c..43d34e413d016 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1239,6 +1239,11 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, * does not have a unique TLB tag (ASID), i.e. EPT is disabled and * KVM was unable to allocate a VPID for L2, flush the current context * as the effective ASID is common to both L1 and L2. + * + * Note that even though TLB_FLUSH_GUEST would be correct because we + * only need to flush linear mappings, it would unnecessarily + * synchronize the MMU even though a TLB flush is not architecturally + * required from L1's perspective. */ if (!nested_has_guest_tlb_tag(vcpu)) kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
On Thu, Jan 16, 2025, Yosry Ahmed wrote: > On Thu, Jan 16, 2025 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote: > > > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in > > > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes > > > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for > > > > > guest-physical translations, so only flushing linear and combined > > > > > translations (i.e. guest-generated translations) is needed. > > > > No, using KVM_REQ_TLB_FLUSH_CURRENT is correct. From *L1's* perspective, VPID > > is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush > > TLBs, and thus KVM is not required to FLUSH_GUEST. > > > > E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the > > PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed > > to retain its old, "stale" SPTEs that map L2 because architecturally they aren't > > guaranteed to be visible to L2. > > > > But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the > > hardware TLBs are flushed. Without EPT, KVM will use different CR3s for L1 and > > L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which > > KVM always pulls from guest CR3, i.e. could be the same for L1 and L2. > > > > Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest() > > is not required in this scenario. > > Aha, I was examining vmx_flush_tlb_guest() not > kvm_vcpu_flush_tlb_guest(), so I missed the synchronization. Yeah I > think it's possible that we end up unnecessarily synchronizing the > shadow page tables (or dropping them) in this case. > > Do you think it's worth expanding the comment in > nested_vmx_transition_tlb_flush()? > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index 2ed454186e59c..43d34e413d016 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -1239,6 +1239,11 @@ static void > nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, > * does not have a unique TLB tag (ASID), i.e. EPT is disabled and > * KVM was unable to allocate a VPID for L2, flush the current context > * as the effective ASID is common to both L1 and L2. > + * > + * Note that even though TLB_FLUSH_GUEST would be correct because we > + * only need to flush linear mappings, it would unnecessarily > + * synchronize the MMU even though a TLB flush is not architecturally > + * required from L1's perspective. I'm open to calling out that there's no flush from L1's perspective, but this is inaccurate. Using TLB_FLUSH_GUEST is simply not correct. Will it cause functional problems? No. But neither would blasting kvm_flush_remote_tlbs(), and I think most people would consider flushing all TLBs on all vCPUs to be a bug. How about: * Note, only the hardware TLB entries need to be flushed, as VPID is * fully enabled from L1's perspective, i.e. there's no architectural * TLB flush from L1's perspective. > */ > if (!nested_has_guest_tlb_tag(vcpu)) > kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > On Thu, Jan 16, 2025 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > > > On Wed, Jan 15, 2025 at 9:27 PM Jim Mattson <jmattson@google.com> wrote: > > > > > On Wed, Jan 15, 2025 at 7:50 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > Use KVM_REQ_TLB_FLUSH_GUEST in this case in > > > > > > nested_vmx_transition_tlb_flush() for consistency. This arguably makes > > > > > > more sense conceptually too -- L1 and L2 cannot share the TLB tag for > > > > > > guest-physical translations, so only flushing linear and combined > > > > > > translations (i.e. guest-generated translations) is needed. > > > > > > No, using KVM_REQ_TLB_FLUSH_CURRENT is correct. From *L1's* perspective, VPID > > > is enabled, and so VM-Entry/VM-Exit are NOT architecturally guaranteed to flush > > > TLBs, and thus KVM is not required to FLUSH_GUEST. > > > > > > E.g. if KVM is using shadow paging (no EPT whatsoever), and L1 has modified the > > > PTEs used to map L2 but has not yet flushed TLBs for L2's VPID, then KVM is allowed > > > to retain its old, "stale" SPTEs that map L2 because architecturally they aren't > > > guaranteed to be visible to L2. > > > > > > But because L1 and L2 share TLB entries *in hardware*, KVM needs to ensure the > > > hardware TLBs are flushed. Without EPT, KVM will use different CR3s for L1 and > > > L2, but Intel's ASID tag doesn't include the CR3 address, only the PCID, which > > > KVM always pulls from guest CR3, i.e. could be the same for L1 and L2. > > > > > > Specifically, the synchronization of shadow roots in kvm_vcpu_flush_tlb_guest() > > > is not required in this scenario. > > > > Aha, I was examining vmx_flush_tlb_guest() not > > kvm_vcpu_flush_tlb_guest(), so I missed the synchronization. Yeah I > > think it's possible that we end up unnecessarily synchronizing the > > shadow page tables (or dropping them) in this case. > > > > Do you think it's worth expanding the comment in > > nested_vmx_transition_tlb_flush()? > > > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > > index 2ed454186e59c..43d34e413d016 100644 > > --- a/arch/x86/kvm/vmx/nested.c > > +++ b/arch/x86/kvm/vmx/nested.c > > @@ -1239,6 +1239,11 @@ static void > > nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, > > * does not have a unique TLB tag (ASID), i.e. EPT is disabled and > > * KVM was unable to allocate a VPID for L2, flush the current context > > * as the effective ASID is common to both L1 and L2. > > + * > > + * Note that even though TLB_FLUSH_GUEST would be correct because we > > + * only need to flush linear mappings, it would unnecessarily > > + * synchronize the MMU even though a TLB flush is not architecturally > > + * required from L1's perspective. > > I'm open to calling out that there's no flush from L1's perspective, but this > is inaccurate. Using TLB_FLUSH_GUEST is simply not correct. Will it cause > functional problems? No. But neither would blasting kvm_flush_remote_tlbs(), > and I think most people would consider flushing all TLBs on all vCPUs to be a > bug. Yeah I meant functionally correct as it does not cause correctness issues, but definitely a problem. > > How about: > > * Note, only the hardware TLB entries need to be flushed, as VPID is > * fully enabled from L1's perspective, i.e. there's no architectural > * TLB flush from L1's perspective. I hate to bikeshed, but I want to explicitly call out that we do not need to synchronize the MMU. Maybe this? diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 2ed454186e59c..a9171909de63d 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1239,6 +1239,11 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, * does not have a unique TLB tag (ASID), i.e. EPT is disabled and * KVM was unable to allocate a VPID for L2, flush the current context * as the effective ASID is common to both L1 and L2. + * + * Note, only the hardware TLB entries need to be flushed, as VPID is + * fully enabled from L1's perspective, i.e. there's no + * architectural TLB flush from L1's perspective. Hence, synchronizing + * the MMU is not required as the mappings are still valid. */ if (!nested_has_guest_tlb_tag(vcpu)) kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
On Thu, Jan 16, 2025, Yosry Ahmed wrote: > On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote: > > How about: > > > > * Note, only the hardware TLB entries need to be flushed, as VPID is > > * fully enabled from L1's perspective, i.e. there's no architectural > > * TLB flush from L1's perspective. > > I hate to bikeshed, but I want to explicitly call out that we do not > need to synchronize the MMU. Why? Honest question, I want to understand what's unclear. My hesitation to talk about synchronizing MMUs is that it brings things into play that aren't super relevant to this specific code, and might even add confusion. Specifically, kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but the fact that this path is reachable if and only if EPT is enabled is completely coincidental. E.g. very hypothetically, if KVM used the same EPT root (I already forgot Intel's new acronym) for L1 and L2, then this would no longer be true: * If L0 uses EPT, L1 and L2 run with different EPTP because * guest_mode is part of kvm_mmu_page_role. Thus, TLB entries * are tagged with different EPTP. L1 vs. L2 EPT usage would no longer use separate ASID tags, and so KVM would need to FLUSH_CURRENT on transitions in most cases, e.g. to purge APICv mappings. The comment above !nested_cpu_has_vpid() talks at length about synchronizing MMUs because the EPT behavior in particular is subtle and rather unintuitive. I.e. the comment is much more about NOT using KVM_REQ_MMU_SYNC than it is about using KVM_REQ_TLB_FLUSH_GUEST. > Maybe this? > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index 2ed454186e59c..a9171909de63d 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -1239,6 +1239,11 @@ static void > nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, > * does not have a unique TLB tag (ASID), i.e. EPT is disabled and > * KVM was unable to allocate a VPID for L2, flush the current context > * as the effective ASID is common to both L1 and L2. > + * > + * Note, only the hardware TLB entries need to be flushed, as VPID is > + * fully enabled from L1's perspective, i.e. there's no > + * architectural TLB flush from L1's perspective. Hence, synchronizing > + * the MMU is not required as the mappings are still valid. > */ > if (!nested_has_guest_tlb_tag(vcpu)) > kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
On Thu, Jan 16, 2025 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > How about: > > > > > > * Note, only the hardware TLB entries need to be flushed, as VPID is > > > * fully enabled from L1's perspective, i.e. there's no architectural > > > * TLB flush from L1's perspective. > > > > I hate to bikeshed, but I want to explicitly call out that we do not > > need to synchronize the MMU. > > Why? Honest question, I want to understand what's unclear. My hesitation to > talk about synchronizing MMUs is that it brings things into play that aren't > super relevant to this specific code, and might even add confusion. Specifically, > kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but > the fact that this path is reachable if and only if EPT is enabled is completely > coincidental. Personally, the main thing that was unclear to me and I wanted to add a comment to clarify was why we use KVM_REQ_TLB_FLUSH_GUEST in the first two cases but KVM_REQ_TLB_FLUSH_CURRENT in the last one. Here's my understanding: In the first case (i.e. !nested_cpu_has_vpid(vmcs12)), the flush is architecturally required from L1's perspective to we need to flush guest-generated TLB entries (and potentially synchronize KVM's MMU). In the second case, KVM does not track the history of VPID12, so the flush *may* be architecturally required from L1's perspective, so we do the same thing. In the last case though, the flush is NOT architecturally required from L1's perspective, it's just an artifact of KVM's potential failure to allocate a dedicated VPID for L2 despite L1 asking for it. So ultimately, I don't want to specifically call out synchronizing MMUs, as much as I want to call out why this case uses KVM_REQ_TLB_FLUSH_CURRENT and not KVM_REQ_TLB_FLUSH_GUEST like the others. I only suggested calling out the MMU synchronization since it's effectively the only difference between the two in this case. I am open to any wording you think is best. I am also fine with just dropping this completely, definitely not the hill to die on :) > > E.g. very hypothetically, if KVM used the same EPT root (I already forgot Intel's > new acronym) for L1 and L2, then this would no longer be true: > > * If L0 uses EPT, L1 and L2 run with different EPTP because > * guest_mode is part of kvm_mmu_page_role. Thus, TLB entries > * are tagged with different EPTP. > > L1 vs. L2 EPT usage would no longer use separate ASID tags, and so KVM would > need to FLUSH_CURRENT on transitions in most cases, e.g. to purge APICv mappings. > > The comment above !nested_cpu_has_vpid() talks at length about synchronizing MMUs > because the EPT behavior in particular is subtle and rather unintuitive. I.e. > the comment is much more about NOT using KVM_REQ_MMU_SYNC than it is about using > KVM_REQ_TLB_FLUSH_GUEST.
On Thu, Jan 16, 2025, Yosry Ahmed wrote: > On Thu, Jan 16, 2025 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > > On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > > How about: > > > > > > > > * Note, only the hardware TLB entries need to be flushed, as VPID is > > > > * fully enabled from L1's perspective, i.e. there's no architectural > > > > * TLB flush from L1's perspective. > > > > > > I hate to bikeshed, but I want to explicitly call out that we do not > > > need to synchronize the MMU. > > > > Why? Honest question, I want to understand what's unclear. My hesitation to > > talk about synchronizing MMUs is that it brings things into play that aren't > > super relevant to this specific code, and might even add confusion. Specifically, > > kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but > > the fact that this path is reachable if and only if EPT is enabled is completely > > coincidental. > > Personally, the main thing that was unclear to me and I wanted to add > a comment to clarify was why we use KVM_REQ_TLB_FLUSH_GUEST in the > first two cases but KVM_REQ_TLB_FLUSH_CURRENT in the last one. > > Here's my understanding: > > In the first case (i.e. !nested_cpu_has_vpid(vmcs12)), the flush is > architecturally required from L1's perspective to we need to flush > guest-generated TLB entries (and potentially synchronize KVM's MMU). > > In the second case, KVM does not track the history of VPID12, so the > flush *may* be architecturally required from L1's perspective, so we > do the same thing. > > In the last case though, the flush is NOT architecturally required > from L1's perspective, it's just an artifact of KVM's potential > failure to allocate a dedicated VPID for L2 despite L1 asking for it. > > So ultimately, I don't want to specifically call out synchronizing > MMUs, as much as I want to call out why this case uses > KVM_REQ_TLB_FLUSH_CURRENT and not KVM_REQ_TLB_FLUSH_GUEST like the > others. I only suggested calling out the MMU synchronization since > it's effectively the only difference between the two in this case. Yep. I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and TLB_FLUSH_CURRENT. I'm not entirely sure where it would be best to document them. I guess maybe where they are #defined? TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's perspective, is architecturally required. The one oddity with TLB_FLUSH_GUEST is that it does NOT include guest-physical mappings, i.e. TLB entries that are associated with an EPT root. TLB_FLUSH_CURRENT is used when KVM needs to flush the hardware TLB(s) for the current CPU+context. The most "obvious" case is for when KVM has modified its page tables. More subtle cases are things like changing which physical CPU the vCPU is running on, and this case where KVM is switching the shadow CR3, VPID is enabled in the host (i.e. hardware won't flush TLBs), and the L1 vs. L2 shadow CR3s are not tagged (i.e. use the same hardware VPID). The use of TLB_FLUSH_GUEST *may* result in an MMU sync, but that's a side effect of an architectural guest TLB flush occurring, not the other way around.
On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > On Thu, Jan 16, 2025 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Thu, Jan 16, 2025, Yosry Ahmed wrote: > > > > On Thu, Jan 16, 2025 at 2:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > How about: > > > > > > > > > > * Note, only the hardware TLB entries need to be flushed, as VPID is > > > > > * fully enabled from L1's perspective, i.e. there's no architectural > > > > > * TLB flush from L1's perspective. > > > > > > > > I hate to bikeshed, but I want to explicitly call out that we do not > > > > need to synchronize the MMU. > > > > > > Why? Honest question, I want to understand what's unclear. My hesitation to > > > talk about synchronizing MMUs is that it brings things into play that aren't > > > super relevant to this specific code, and might even add confusion. Specifically, > > > kvm_vcpu_flush_tlb_guest() does NOT synchronize MMUs when EPT/TDP is enabled, but > > > the fact that this path is reachable if and only if EPT is enabled is completely > > > coincidental. > > > > Personally, the main thing that was unclear to me and I wanted to add > > a comment to clarify was why we use KVM_REQ_TLB_FLUSH_GUEST in the > > first two cases but KVM_REQ_TLB_FLUSH_CURRENT in the last one. > > > > Here's my understanding: > > > > In the first case (i.e. !nested_cpu_has_vpid(vmcs12)), the flush is > > architecturally required from L1's perspective to we need to flush > > guest-generated TLB entries (and potentially synchronize KVM's MMU). > > > > In the second case, KVM does not track the history of VPID12, so the > > flush *may* be architecturally required from L1's perspective, so we > > do the same thing. > > > > In the last case though, the flush is NOT architecturally required > > from L1's perspective, it's just an artifact of KVM's potential > > failure to allocate a dedicated VPID for L2 despite L1 asking for it. > > > > So ultimately, I don't want to specifically call out synchronizing > > MMUs, as much as I want to call out why this case uses > > KVM_REQ_TLB_FLUSH_CURRENT and not KVM_REQ_TLB_FLUSH_GUEST like the > > others. I only suggested calling out the MMU synchronization since > > it's effectively the only difference between the two in this case. > > Yep. I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and > TLB_FLUSH_CURRENT. I'm not entirely sure where it would be best to document > them. I guess maybe where they are #defined? I guess at the #define we can just mention that they result in calling kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if anything. The specific documentation about what they do could be above the functions themselves, and describing the potential MMU sync is naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda already there). The flush_tlb_guest() callback is documented in kvm_host.h, but not flush_tlb_current(). I was going to suggest just documenting that. But kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but it also potentially synchronizes the MMU. So only documenting the callbacks does not paint a full picture. FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*() functions are more-or-less thin wrappers around the per-vendor callbacks -- except kvm_vcpu_flush_tlb_guest(). > > TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's > perspective, is architecturally required. The one oddity with TLB_FLUSH_GUEST > is that it does NOT include guest-physical mappings, i.e. TLB entries that are > associated with an EPT root. The way I think about this is how it's documented above the per-vendor callback. It flushes translations created by the guest. The guest does not (directly) create guest-physical translations, only linear and combined translations. > > TLB_FLUSH_CURRENT is used when KVM needs to flush the hardware TLB(s) for the > current CPU+context. The most "obvious" case is for when KVM has modified its > page tables. More subtle cases are things like changing which physical CPU the > vCPU is running on, and this case where KVM is switching the shadow CR3, VPID is > enabled in the host (i.e. hardware won't flush TLBs), and the L1 vs. L2 shadow > CR3s are not tagged (i.e. use the same hardware VPID). > > The use of TLB_FLUSH_GUEST *may* result in an MMU sync, but that's a side effect > of an architectural guest TLB flush occurring, not the other way around. Agreed.
On Fri, Jan 17, 2025, Yosry Ahmed wrote: > On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote: > > Yep. I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and > > TLB_FLUSH_CURRENT. I'm not entirely sure where it would be best to document > > them. I guess maybe where they are #defined? > > I guess at the #define we can just mention that they result in calling > kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if > anything. Yeah, a "See xx for details" redirect is probably the best option. > The specific documentation about what they do could be above the > functions themselves, and describing the potential MMU sync is > naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda > already there). > > The flush_tlb_guest() callback is documented in kvm_host.h, but not > flush_tlb_current(). I was going to suggest just documenting that. But > kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but > it also potentially synchronizes the MMU. So only documenting the > callbacks does not paint a full picture. > > FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*() > functions are more-or-less thin wrappers around the per-vendor > callbacks -- except kvm_vcpu_flush_tlb_guest(). > > > > > TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's > > perspective, is architecturally required. The one oddity with TLB_FLUSH_GUEST > > is that it does NOT include guest-physical mappings, i.e. TLB entries that are > > associated with an EPT root. > > The way I think about this is how it's documented above the per-vendor > callback. It flushes translations created by the guest. The guest does > not (directly) create guest-physical translations, only linear and > combined translations. That's not accurate either. When L1 is using nested TDP, it does create guest- physical translations. The lack of any form of handling in TLB_FLUSH_GUEST is a reflection of two things: EPT is weird, and nested SVM doesn't yet support precise flushing on transitions, i.e. nested NPT handling is missing because KVM unconditionally flushes and synchronizes. EPT is "weird" because the _only_ time guest-physical translations are flushed is when the "wrong" KVM MMU is loaded. The only way to flush guest-physical translations (short of RESET :-D) is via INVEPT, and INVEPT is a root-only (VMX terminology) instruction, i.e. can only be executed by L1. And because L1 can't itself be using EPT[*], INVEPT can never target/flush the current context. Furthermore, INVEPT isn't strictly tied to a VMCS, e.g. deferring the emulated flush until the next time KVM runs a vmcs12 isn't viable. Rather than add dedicated tracking, KVM simply unloads the roots and lets the normal root "allocation" handle the flush+sync the next time the vCPU uses the associated MMU. Nested NPT is different, as there is no INVNPT. Instead, there's the ASID itself and a flushing control, both of which are properties of the VMCB. As a result, NPT TLB flushes that are initiated by a hypervisor always take effect at VMRUN, e.g. by bumping the ASID, or via the dedicated flushing control. So when proper handling of TLB flushing on nested SVM transition comes along, I do expect that either kvm_vcpu_flush_tlb_guest() will grow. Or maybe we'll add yet another TLB_FLUSH_XXX flavor :-) One thing that could be helpful would be to document that KVM doesn't use TLB_FLUSH_GUEST to handle INVEPT, and so there's no need to sync nested TDP MMUs. [*] Even in a deprivileged scenario like pKVM, the guest kernel would become L2 from KVM's perspective.
On Fri, Jan 17, 2025 at 4:03 PM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Jan 17, 2025, Yosry Ahmed wrote: > > On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote: > > > Yep. I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and > > > TLB_FLUSH_CURRENT. I'm not entirely sure where it would be best to document > > > them. I guess maybe where they are #defined? > > > > I guess at the #define we can just mention that they result in calling > > kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if > > anything. > > Yeah, a "See xx for details" redirect is probably the best option. > > > The specific documentation about what they do could be above the > > functions themselves, and describing the potential MMU sync is > > naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda > > already there). > > > > The flush_tlb_guest() callback is documented in kvm_host.h, but not > > flush_tlb_current(). I was going to suggest just documenting that. But > > kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but > > it also potentially synchronizes the MMU. So only documenting the > > callbacks does not paint a full picture. > > > > FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*() > > functions are more-or-less thin wrappers around the per-vendor > > callbacks -- except kvm_vcpu_flush_tlb_guest(). > > > > > > > > TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's > > > perspective, is architecturally required. The one oddity with TLB_FLUSH_GUEST > > > is that it does NOT include guest-physical mappings, i.e. TLB entries that are > > > associated with an EPT root. > > > > The way I think about this is how it's documented above the per-vendor > > callback. It flushes translations created by the guest. The guest does > > not (directly) create guest-physical translations, only linear and > > combined translations. > > That's not accurate either. When L1 is using nested TDP, it does create guest- > physical translations. Ah yes, I missed that. > The lack of any form of handling in TLB_FLUSH_GUEST is > a reflection of two things: EPT is weird, and nested SVM doesn't yet support > precise flushing on transitions, i.e. nested NPT handling is missing because KVM > unconditionally flushes and synchronizes. Even for nested NPT, we shouldn't flush/synchronize the guest-physical mappings unless L1 specifies that in VMCB12. So I suspect TLB_FLUSH_GUEST will remain ignorant to nested guest-physical mappings anyway. > > EPT is "weird" because the _only_ time guest-physical translations are flushed > is when the "wrong" KVM MMU is loaded. The only way to flush guest-physical > translations (short of RESET :-D) is via INVEPT, and INVEPT is a root-only (VMX > terminology) instruction, i.e. can only be executed by L1. And because L1 can't > itself be using EPT[*], INVEPT can never target/flush the current context I think you meant L0 here. > > Furthermore, INVEPT isn't strictly tied to a VMCS, e.g. deferring the emulated > flush until the next time KVM runs a vmcs12 isn't viable. Rather than add > dedicated tracking, KVM simply unloads the roots and lets the normal root > "allocation" handle the flush+sync the next time the vCPU uses the associated MMU. I think you are referring to what you mentioned to me offline, that INVEPT could be affecting more than one cached root in KVM's MMU, not just the loaded root when running VMCS12, which is why we cannot defer synchronizing or unloading the roots. > > Nested NPT is different, as there is no INVNPT. Instead, there's the ASID itself > and a flushing control, both of which are properties of the VMCB. As a result, > NPT TLB flushes that are initiated by a hypervisor always take effect at VMRUN, > e.g. by bumping the ASID, or via the dedicated flushing control. > > So when proper handling of TLB flushing on nested SVM transition comes along, I > do expect that either kvm_vcpu_flush_tlb_guest() will grow. Or maybe we'll add > yet another TLB_FLUSH_XXX flavor :-) I think kvm_vcpu_flush_tlb_guest() should remain ignorant because it's used for a lot of operations flushing the TLB on behalf of the guest, which should not flush nested guest-physical translations. Anyway, I am trying to come up with a PoC for the "proper handling", so we'll see how that turns out :) > > One thing that could be helpful would be to document that KVM doesn't use > TLB_FLUSH_GUEST to handle INVEPT, and so there's no need to sync nested TDP MMUs. I think what to document may become clearer when the nSVM implementation comes along. > > [*] Even in a deprivileged scenario like pKVM, the guest kernel would become L2 > from KVM's perspective.
© 2016 - 2025 Red Hat, Inc.