TLB_CONTROL is reset to TLB_CONTROL_DO_NOTHING on nested transitions to
L2 in nested_vmcb02_prepare_control(). This is unnecessary because:
- TLB_CONTROL is set in vcpu_enter_guest() if needed when servicing a
TLB flush request (by svm_flush_tlb_asid()).
- TLB_CONTROL is reset to TLB_CONTROL_DO_NOTHING after the guest is run
in svm_cpu_run().
Hence, at the point where nested_vmcb02_prepare_control() is called,
TLB_CONTROL should have already be set to TLB_CONTROL_DO_NOTHING by
svm_cpu_run() after L1 runs.
There is a TODO in nested_svm_transition_tlb_flush() about this reset
crushing pending TLB flushes. Remove it, as the reset is not really
crushing anything as explained above.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ca8db246ac050..544913461693c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -511,12 +511,7 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
svm->nested.last_asid = svm->nested.ctl.asid;
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}
- /*
- * TODO: optimize unconditional TLB flush/MMU sync. A partial list of
- * things to fix before this can be conditional:
- *
- * - Don't crush a pending TLB flush in vmcb02 on nested VMRUN
- */
+ /* TODO: optimize unconditional TLB flush/MMU sync */
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
}
@@ -710,9 +705,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
vmcb02->control.asid = svm_nested_asid(vcpu->kvm);
- /* Also overwritten later if necessary. */
- vmcb_clr_flush_asid(vmcb02);
-
/* nested_cr3. */
if (nested_npt_enabled(svm))
nested_svm_init_mmu_context(vcpu);
--
2.49.0.395.g12beb8f557-goog
KVM does not track TLB flush requests for L1 vs. L2. Hence, service
local flush that target the current context before switching to a new
one. Since the current ASID is identified through the current VMCB
(nested or not), service the flushes before every VMCB switch.
This is conceptually similar to how nVMX calls
kvm_service_local_tlb_flush_requests() in
nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(), with the
following differences:
1. VMX tracks the current VPID based on is_guest_mode(), so local TLB
flushes are serviced before enter_guest_mode() and
leave_guest_mode(). On the other hand, SVM tracks the current ASID
based on the current VMCB, so the TLB flushes are serviced before an
VMCB switch.
2. nVMX only enters and leaves guest mode in
nested_vmx_enter_non_root_mode() and nested_vmx_vmexit(). Other paths
like vmx_set_nested_state() and vmx_leave_nested() call into these
two functions. On the other hand, nSVM open codes the switch in
functions like svm_set_nested_state() and svm_leave_nested(), so
servicing the flush in svm_switch_svm() is probably most reliable.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/svm.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3e33ac876eb32..3649707c61d3e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1439,6 +1439,12 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
{
+ /*
+ * The current ASID is identified through the VMCB. Perform any pending
+ * TLB flushes for the current VMCB before switching to a new one.
+ */
+ kvm_service_local_tlb_flush_requests(&svm->vcpu);
+
svm->current_vmcb = target_vmcb;
svm->vmcb = target_vmcb->ptr;
}
--
2.49.0.395.g12beb8f557-goog
Currently, INVPLGA interception handles it like INVLPG, which flushes
L1's TLB translations for the address. It was implemented in this way
because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
is still harmless to flush L1's translations, but it's only correct
because all translations are flushed on nested transitions anyway.
In preparation for stopping unconditional flushes on nested transitions,
handle INVPLGA interception properly. If L1 specified zero as the ASID,
this is equivalent to INVLPG, so handle it as such. Otherwise, use
INVPLGA to flush the translations of the appropriate ASID tracked by
KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
mappings.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 5 +++--
arch/x86/kvm/svm/svm.c | 36 +++++++++++++++++++++++++++++++--
3 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d881e7d276b12..a158d324168a0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2237,6 +2237,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
void *insn, int insn_len);
void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
+void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+ u64 addr, unsigned long roots, bool gva_flush);
void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
u64 addr, unsigned long roots);
void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2b1994f12753..d3baa12df84e7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6355,8 +6355,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
write_unlock(&vcpu->kvm->mmu_lock);
}
-static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
- u64 addr, unsigned long roots, bool gva_flush)
+void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+ u64 addr, unsigned long roots, bool gva_flush)
{
int i;
@@ -6382,6 +6382,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
}
}
+EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
u64 addr, unsigned long roots)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3649707c61d3e..4b95fd6b501e6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2505,6 +2505,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
static int invlpga_interception(struct kvm_vcpu *vcpu)
{
+ struct vcpu_svm *svm = to_svm(vcpu);
gva_t gva = kvm_rax_read(vcpu);
u32 asid = kvm_rcx_read(vcpu);
@@ -2514,8 +2515,39 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
- /* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
- kvm_mmu_invlpg(vcpu, gva);
+ /*
+ * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
+ * Do the logical thing and handle it like INVLPG.
+ */
+ if (asid == 0) {
+ kvm_mmu_invlpg(vcpu, gva);
+ return kvm_skip_emulated_instruction(vcpu);
+ }
+
+ /*
+ * Check if L1 specified the L2 ASID we are currently tracking. If it
+ * isn't, do nothing as we have to handle the TLB flush when switching
+ * to the new ASID anyway.
+ */
+ if (asid == svm->nested.last_asid)
+ invlpga(gva, svm_nested_asid(vcpu->kvm));
+
+ /*
+ * If NPT is disabled, sync the shadow page tables as L1 is invalidating
+ * mappings for L2. Sync all roots as ASIDs are not tracked in the MMU
+ * role.
+ *
+ * As we are not flushing the current context, skip the gva flush from
+ * __kvm_mmu_invalidate_addr(), it would flush the wrong ASID anyway.
+ * The correct TLB flush was done above (if needed).
+ *
+ * This always operates on root_mmu because L1 and L2 share an MMU when
+ * NPT is disabled. This can be optimized by invalidating guest roots
+ * only.
+ */
+ if (!npt_enabled)
+ __kvm_mmu_invalidate_addr(vcpu, &vcpu->arch.root_mmu, gva,
+ KVM_MMU_ROOTS_ALL, false);
return kvm_skip_emulated_instruction(vcpu);
}
--
2.49.0.395.g12beb8f557-goog
On Wed, Mar 26, 2025, Yosry Ahmed wrote:
> Currently, INVPLGA interception handles it like INVLPG, which flushes
> L1's TLB translations for the address. It was implemented in this way
> because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
> is still harmless to flush L1's translations, but it's only correct
> because all translations are flushed on nested transitions anyway.
>
> In preparation for stopping unconditional flushes on nested transitions,
> handle INVPLGA interception properly. If L1 specified zero as the ASID,
> this is equivalent to INVLPG, so handle it as such. Otherwise, use
> INVPLGA to flush the translations of the appropriate ASID tracked by
> KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
> mappings.
>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 5 +++--
> arch/x86/kvm/svm/svm.c | 36 +++++++++++++++++++++++++++++++--
> 3 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d881e7d276b12..a158d324168a0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2237,6 +2237,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> void *insn, int insn_len);
> void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
> void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
> +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> + u64 addr, unsigned long roots, bool gva_flush);
> void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> u64 addr, unsigned long roots);
> void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2b1994f12753..d3baa12df84e7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6355,8 +6355,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
> write_unlock(&vcpu->kvm->mmu_lock);
> }
>
> -static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> - u64 addr, unsigned long roots, bool gva_flush)
> +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> + u64 addr, unsigned long roots, bool gva_flush)
I don't love passing a boolean to avoid a flush. I especially don't like it in
this case because vmx_flush_tlb_gva() has similar logic. Unfortunately, I don't
see a better option at this point. :-/
If we do keep the param, it needs to be something like @flush_gva, because I
read @gva_flush as "this is a gva flush", and got all kinds of confused when
reading the code.
> {
> int i;
>
> @@ -6382,6 +6382,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
> kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
> }
> }
> +EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
>
> void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> u64 addr, unsigned long roots)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 3649707c61d3e..4b95fd6b501e6 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2505,6 +2505,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
>
> static int invlpga_interception(struct kvm_vcpu *vcpu)
> {
> + struct vcpu_svm *svm = to_svm(vcpu);
> gva_t gva = kvm_rax_read(vcpu);
> u32 asid = kvm_rcx_read(vcpu);
>
> @@ -2514,8 +2515,39 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
>
> trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
>
> - /* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
> - kvm_mmu_invlpg(vcpu, gva);
This code needs to do a noncanonical check (assuming we can't figure out a way
to shoehorn this into kvm_mmu_invlpg()). Consuming gva here for the asid != 0
case might be "fine", because INVLPGA won't fault, but it's still a bug, e.g. I
don't know what will happen when KVM tries to synchronize MMUs.
Another reason I don't love the @flush_gva param :-/
> + /*
> + * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
> + * Do the logical thing and handle it like INVLPG.
> + */
> + if (asid == 0) {
if (!asid)
> + kvm_mmu_invlpg(vcpu, gva);
> + return kvm_skip_emulated_instruction(vcpu);
> + }
> +
> + /*
> + * Check if L1 specified the L2 ASID we are currently tracking. If it
> + * isn't, do nothing as we have to handle the TLB flush when switching
> + * to the new ASID anyway.
> + */
Please avoid pronoouns. And try not to allude to behavior; the above doesn't
actually say what happens when switching to a new ASID, only that "we have to
handle the TLB flush". E.g.
/*
* Flush hardware TLB entries only if L1 is flushing KVM's currently
* tracked L2 ASID. KVM does a full TLB flush when L1 runs a VMCB with
* a different L2 ASID.
*/
> + if (asid == svm->nested.last_asid)
> + invlpga(gva, svm_nested_asid(vcpu->kvm));
> +
> + /*
> + * If NPT is disabled, sync the shadow page tables as L1 is invalidating
> + * mappings for L2. Sync all roots as ASIDs are not tracked in the MMU
> + * role.
> + *
> + * As we are not flushing the current context, skip the gva flush from
> + * __kvm_mmu_invalidate_addr(), it would flush the wrong ASID anyway.
> + * The correct TLB flush was done above (if needed).
> + *
> + * This always operates on root_mmu because L1 and L2 share an MMU when
> + * NPT is disabled. This can be optimized by invalidating guest roots
> + * only.
Heh, I had a comment typed up about only need to sync guest roots, and then I
read this literal comment. :-)
> + */
> + if (!npt_enabled)
> + __kvm_mmu_invalidate_addr(vcpu, &vcpu->arch.root_mmu, gva,
> + KVM_MMU_ROOTS_ALL, false);
>
> return kvm_skip_emulated_instruction(vcpu);
> }
> --
> 2.49.0.395.g12beb8f557-goog
>
On Wed, 2025-03-26 at 19:44 +0000, Yosry Ahmed wrote:
> Currently, INVPLGA interception handles it like INVLPG, which flushes
> L1's TLB translations for the address. It was implemented in this way
> because L1 and L2 shared an ASID. Now, L1 and L2 have separate ASIDs. It
> is still harmless to flush L1's translations, but it's only correct
> because all translations are flushed on nested transitions anyway.
>
> In preparation for stopping unconditional flushes on nested transitions,
> handle INVPLGA interception properly. If L1 specified zero as the ASID,
> this is equivalent to INVLPG, so handle it as such. Otherwise, use
> INVPLGA to flush the translations of the appropriate ASID tracked by
> KVM, if any. Sync the shadow MMU as well, as L1 invalidated L2's
> mappings.
>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 5 +++--
> arch/x86/kvm/svm/svm.c | 36 +++++++++++++++++++++++++++++++--
> 3 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d881e7d276b12..a158d324168a0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2237,6 +2237,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> void *insn, int insn_len);
> void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
> void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
> +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> + u64 addr, unsigned long roots, bool gva_flush);
> void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> u64 addr, unsigned long roots);
> void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2b1994f12753..d3baa12df84e7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6355,8 +6355,8 @@ static void kvm_mmu_invalidate_addr_in_root(struct kvm_vcpu *vcpu,
> write_unlock(&vcpu->kvm->mmu_lock);
> }
>
> -static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> - u64 addr, unsigned long roots, bool gva_flush)
> +void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> + u64 addr, unsigned long roots, bool gva_flush)
> {
> int i;
>
> @@ -6382,6 +6382,7 @@ static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu
> kvm_mmu_invalidate_addr_in_root(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
> }
> }
> +EXPORT_SYMBOL_GPL(__kvm_mmu_invalidate_addr);
>
> void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> u64 addr, unsigned long roots)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 3649707c61d3e..4b95fd6b501e6 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2505,6 +2505,7 @@ static int clgi_interception(struct kvm_vcpu *vcpu)
>
> static int invlpga_interception(struct kvm_vcpu *vcpu)
> {
> + struct vcpu_svm *svm = to_svm(vcpu);
> gva_t gva = kvm_rax_read(vcpu);
> u32 asid = kvm_rcx_read(vcpu);
>
> @@ -2514,8 +2515,39 @@ static int invlpga_interception(struct kvm_vcpu *vcpu)
>
> trace_kvm_invlpga(to_svm(vcpu)->vmcb->save.rip, asid, gva);
>
> - /* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
> - kvm_mmu_invlpg(vcpu, gva);
> + /*
> + * APM is silent about using INVLPGA to flush the host ASID (i.e. 0).
> + * Do the logical thing and handle it like INVLPG.
> + */
> + if (asid == 0) {
> + kvm_mmu_invlpg(vcpu, gva);
> + return kvm_skip_emulated_instruction(vcpu);
> + }
> +
> + /*
> + * Check if L1 specified the L2 ASID we are currently tracking. If it
> + * isn't, do nothing as we have to handle the TLB flush when switching
> + * to the new ASID anyway.
> + */
> + if (asid == svm->nested.last_asid)
> + invlpga(gva, svm_nested_asid(vcpu->kvm));
> +
> + /*
> + * If NPT is disabled, sync the shadow page tables as L1 is invalidating
> + * mappings for L2. Sync all roots as ASIDs are not tracked in the MMU
> + * role.
> + *
> + * As we are not flushing the current context, skip the gva flush from
> + * __kvm_mmu_invalidate_addr(), it would flush the wrong ASID anyway.
> + * The correct TLB flush was done above (if needed).
> + *
> + * This always operates on root_mmu because L1 and L2 share an MMU when
> + * NPT is disabled. This can be optimized by invalidating guest roots
> + * only.
> + */
> + if (!npt_enabled)
> + __kvm_mmu_invalidate_addr(vcpu, &vcpu->arch.root_mmu, gva,
> + KVM_MMU_ROOTS_ALL, false);
>
> return kvm_skip_emulated_instruction(vcpu);
> }
Looks good.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Best regards,
Maxim Levitsky
Now that nested TLB flushes are properly tracked, start allocating a
separate ASID for nested guests. This allows dropping the unconditional
TLB flushes on nested transitions and doing finer grained TLB flushing
when necessary.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 11 +++++++++--
arch/x86/kvm/svm/svm.c | 5 +++--
arch/x86/kvm/svm/svm.h | 3 +++
3 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 544913461693c..0c887c91bd50d 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1204,6 +1204,7 @@ int svm_allocate_nested(struct vcpu_svm *svm)
{
struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
struct page *vmcb02_page;
+ unsigned int asid;
if (svm->nested.initialized)
return 0;
@@ -1221,8 +1222,14 @@ int svm_allocate_nested(struct vcpu_svm *svm)
svm->nested.initialized = true;
- if (!kvm_svm->nested_asid)
- kvm_svm->nested_asid = kvm_svm->asid;
+ if (!kvm_svm->nested_asid) {
+ asid = kvm_tlb_tags_alloc(&svm_asids);
+ if (asid && !svm_register_asid(asid)) {
+ kvm_tlb_tags_free(&svm_asids, asid);
+ asid = 0;
+ }
+ kvm_svm->nested_asid = asid ?: fallback_asid;
+ }
return 0;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 4b95fd6b501e6..196f5bca57a0e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -249,8 +249,8 @@ static unsigned long iopm_base;
DEFINE_PER_CPU(struct svm_cpu_data, svm_data);
-static struct kvm_tlb_tags svm_asids;
-static unsigned int fallback_asid;
+struct kvm_tlb_tags svm_asids;
+unsigned int fallback_asid;
/*
* Only MSR_TSC_AUX is switched via the user return hook. EFER is switched via
@@ -5127,6 +5127,7 @@ static void svm_vm_destroy(struct kvm *kvm)
avic_vm_destroy(kvm);
sev_vm_destroy(kvm);
kvm_tlb_tags_free(&svm_asids, kvm_svm->asid);
+ kvm_tlb_tags_free(&svm_asids, kvm_svm->nested_asid);
}
static int svm_vm_init(struct kvm *kvm)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0c44133bc05ca..220d10d2b1a5c 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -630,6 +630,9 @@ static inline void svm_vmgexit_no_action(struct vcpu_svm *svm, u64 data)
extern bool dump_invalid_vmcb;
+extern struct kvm_tlb_tags svm_asids;
+extern unsigned int fallback_asid;
+
u32 svm_msrpm_offset(u32 msr);
u32 *svm_vcpu_alloc_msrpm(void);
void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm);
--
2.49.0.395.g12beb8f557-goog
On Wed, 2025-03-26 at 19:44 +0000, Yosry Ahmed wrote:
> Now that nested TLB flushes are properly tracked, start allocating a
> separate ASID for nested guests. This allows dropping the unconditional
> TLB flushes on nested transitions and doing finer grained TLB flushing
> when necessary.
>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/kvm/svm/nested.c | 11 +++++++++--
> arch/x86/kvm/svm/svm.c | 5 +++--
> arch/x86/kvm/svm/svm.h | 3 +++
> 3 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 544913461693c..0c887c91bd50d 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1204,6 +1204,7 @@ int svm_allocate_nested(struct vcpu_svm *svm)
> {
> struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
> struct page *vmcb02_page;
> + unsigned int asid;
>
> if (svm->nested.initialized)
> return 0;
> @@ -1221,8 +1222,14 @@ int svm_allocate_nested(struct vcpu_svm *svm)
>
> svm->nested.initialized = true;
>
> - if (!kvm_svm->nested_asid)
> - kvm_svm->nested_asid = kvm_svm->asid;
> + if (!kvm_svm->nested_asid) {
> + asid = kvm_tlb_tags_alloc(&svm_asids);
> + if (asid && !svm_register_asid(asid)) {
> + kvm_tlb_tags_free(&svm_asids, asid);
> + asid = 0;
> + }
> + kvm_svm->nested_asid = asid ?: fallback_asid;
> + }
Nitpick: AFAIK at least nested KVM doesn't enable EFER.SVME,
unless it actually runs a guest thus most of the time we will waste a ASID on a VM
which once did run a VM nested and since then doesn't run anything else.
So maybe we want to free the nested ASID in the svm_free_nested?
>
> return 0;
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 4b95fd6b501e6..196f5bca57a0e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -249,8 +249,8 @@ static unsigned long iopm_base;
>
> DEFINE_PER_CPU(struct svm_cpu_data, svm_data);
>
> -static struct kvm_tlb_tags svm_asids;
> -static unsigned int fallback_asid;
> +struct kvm_tlb_tags svm_asids;
> +unsigned int fallback_asid;
>
> /*
> * Only MSR_TSC_AUX is switched via the user return hook. EFER is switched via
> @@ -5127,6 +5127,7 @@ static void svm_vm_destroy(struct kvm *kvm)
> avic_vm_destroy(kvm);
> sev_vm_destroy(kvm);
> kvm_tlb_tags_free(&svm_asids, kvm_svm->asid);
> + kvm_tlb_tags_free(&svm_asids, kvm_svm->nested_asid);
> }
>
> static int svm_vm_init(struct kvm *kvm)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 0c44133bc05ca..220d10d2b1a5c 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -630,6 +630,9 @@ static inline void svm_vmgexit_no_action(struct vcpu_svm *svm, u64 data)
>
> extern bool dump_invalid_vmcb;
>
> +extern struct kvm_tlb_tags svm_asids;
> +extern unsigned int fallback_asid;
> +
> u32 svm_msrpm_offset(u32 msr);
> u32 *svm_vcpu_alloc_msrpm(void);
> void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm);
Best regards,
Maxim Levitsky
On Thu, Apr 03, 2025 at 04:11:47PM -0400, Maxim Levitsky wrote:
> On Wed, 2025-03-26 at 19:44 +0000, Yosry Ahmed wrote:
> > Now that nested TLB flushes are properly tracked, start allocating a
> > separate ASID for nested guests. This allows dropping the unconditional
> > TLB flushes on nested transitions and doing finer grained TLB flushing
> > when necessary.
> >
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> > arch/x86/kvm/svm/nested.c | 11 +++++++++--
> > arch/x86/kvm/svm/svm.c | 5 +++--
> > arch/x86/kvm/svm/svm.h | 3 +++
> > 3 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 544913461693c..0c887c91bd50d 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1204,6 +1204,7 @@ int svm_allocate_nested(struct vcpu_svm *svm)
> > {
> > struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
> > struct page *vmcb02_page;
> > + unsigned int asid;
> >
> > if (svm->nested.initialized)
> > return 0;
> > @@ -1221,8 +1222,14 @@ int svm_allocate_nested(struct vcpu_svm *svm)
> >
> > svm->nested.initialized = true;
> >
> > - if (!kvm_svm->nested_asid)
> > - kvm_svm->nested_asid = kvm_svm->asid;
> > + if (!kvm_svm->nested_asid) {
> > + asid = kvm_tlb_tags_alloc(&svm_asids);
> > + if (asid && !svm_register_asid(asid)) {
> > + kvm_tlb_tags_free(&svm_asids, asid);
> > + asid = 0;
> > + }
> > + kvm_svm->nested_asid = asid ?: fallback_asid;
> > + }
>
> Nitpick: AFAIK at least nested KVM doesn't enable EFER.SVME,
> unless it actually runs a guest thus most of the time we will waste a ASID on a VM
> which once did run a VM nested and since then doesn't run anything else.
Oh yeah, I missed that, thanks. Will do.
>
> So maybe we want to free the nested ASID in the svm_free_nested?
>
> >
> > return 0;
> >
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 4b95fd6b501e6..196f5bca57a0e 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -249,8 +249,8 @@ static unsigned long iopm_base;
> >
> > DEFINE_PER_CPU(struct svm_cpu_data, svm_data);
> >
> > -static struct kvm_tlb_tags svm_asids;
> > -static unsigned int fallback_asid;
> > +struct kvm_tlb_tags svm_asids;
> > +unsigned int fallback_asid;
> >
> > /*
> > * Only MSR_TSC_AUX is switched via the user return hook. EFER is switched via
> > @@ -5127,6 +5127,7 @@ static void svm_vm_destroy(struct kvm *kvm)
> > avic_vm_destroy(kvm);
> > sev_vm_destroy(kvm);
> > kvm_tlb_tags_free(&svm_asids, kvm_svm->asid);
> > + kvm_tlb_tags_free(&svm_asids, kvm_svm->nested_asid);
> > }
> >
> > static int svm_vm_init(struct kvm *kvm)
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index 0c44133bc05ca..220d10d2b1a5c 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -630,6 +630,9 @@ static inline void svm_vmgexit_no_action(struct vcpu_svm *svm, u64 data)
> >
> > extern bool dump_invalid_vmcb;
> >
> > +extern struct kvm_tlb_tags svm_asids;
> > +extern unsigned int fallback_asid;
> > +
> > u32 svm_msrpm_offset(u32 msr);
> > u32 *svm_vcpu_alloc_msrpm(void);
> > void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm);
>
>
> Best regards,
> Maxim Levitsky
>
>
>
Now that nested TLB flushes are properly tracked with a well-maintained
separate ASID for L2 and proper handling of L1's TLB flush requests,
drop the unconditional flushes and syncs on nested transitions.
On a Milan machine, an L1 and L2 guests were booted, both with a single
vCPU, and pinned to a single physical CPU to maximize TLB collisions. In
this setup, the cpuid_rate microbenchmark [1] showed the following
changes with this patch:
+--------+--------+-------------------+----------------------+
| L0 | L1 | cpuid_rate (base) | cpuid_rate (patched) |
+========+========+===================+======================+
| NPT | NPT | 256621 | 301113 (+17.3%) |
| NPT | Shadow | 180017 | 203347 (+12.96%) |
| Shadow | Shadow | 177006 | 189150 (+6.86%) |
+--------+--------+-------------------+----------------------+
[1]https://lore.kernel.org/kvm/20231109180646.2963718-1-khorenko@virtuozzo.com/
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
arch/x86/kvm/svm/nested.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 0c887c91bd50d..1cc281a7b666c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -511,9 +511,6 @@ static void nested_svm_entry_tlb_flush(struct kvm_vcpu *vcpu)
svm->nested.last_asid = svm->nested.ctl.asid;
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}
- /* TODO: optimize unconditional TLB flush/MMU sync */
- kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
- kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
}
/* See nested_svm_entry_tlb_flush() */
@@ -525,9 +522,6 @@ static void nested_svm_exit_tlb_flush(struct kvm_vcpu *vcpu)
if (svm->nested.ctl.tlb_ctl == TLB_CONTROL_FLUSH_ALL_ASID)
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
-
- kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
- kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
}
/*
--
2.49.0.395.g12beb8f557-goog
© 2016 - 2025 Red Hat, Inc.