[v5] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock

[PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by James Houghton 3 months ago

From: Vipin Sharma <vipinsh@google.com>

Use MMU read lock to recover TDP MMU NX huge pages. Iterate
over the huge pages list under tdp_mmu_pages_lock protection and
unaccount the page before dropping the lock.

We must not zap an SPTE if:
- The SPTE is a root page.
- The SPTE does not point at the SP's page table.

If the SPTE does not point at the SP's page table, then something else
has change the SPTE, so we cannot safely zap it.

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table. This should never happen.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an executable
small page. Unaccounting sooner during the list traversal is increasing
the window of that race. Functionally, it is okay, because accounting
doesn't protect against iTLB multi-hit bug, it is there purely to
prevent KVM from bouncing a gfn between two page sizes. The only
downside is that a vCPU will end up doing more work in tearing down all
the child SPTEs. This should be a very rare race.

Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
read lock. This optimizaion is done to solve a guest jitter issue on
Windows VM which was observing an increase in network latency.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
 arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
 2 files changed, 105 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b074f7bb5cc58..7df1b4ead705b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
 	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
 }
 
+static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
+					     struct kvm_mmu_page *sp)
+{
+	struct kvm_memory_slot *slot = NULL;
+
+	/*
+	 * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
+	 * it the test cannot possibly return true.  On the other hand, if any
+	 * memslot has logging enabled, chances are good that all of them do, in
+	 * which case unaccount_nx_huge_page() is much cheaper than zapping the
+	 * page.
+	 *
+	 * If a memslot update is in progress, reading an incorrect value of
+	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
+	 * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
+	 * nonzero, the page will be zapped unnecessarily.  Either way, this only
+	 * affects efficiency in racy situations, and not correctness.
+	 */
+	if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
+		struct kvm_memslots *slots;
+
+		slots = kvm_memslots_for_spte_role(kvm, sp->role);
+		slot = __gfn_to_memslot(slots, sp->gfn);
+		WARN_ON_ONCE(!slot);
+	}
+	return slot && kvm_slot_dirty_track_enabled(slot);
+}
+
 static void kvm_recover_nx_huge_pages(struct kvm *kvm,
-				      enum kvm_mmu_type mmu_type)
+				      const enum kvm_mmu_type mmu_type)
 {
 	unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
+	bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
 	struct list_head *nx_huge_pages;
-	struct kvm_memory_slot *slot;
 	struct kvm_mmu_page *sp;
 	LIST_HEAD(invalid_list);
 	bool flush = false;
@@ -7549,7 +7577,10 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 	nx_huge_pages = &kvm->arch.possible_nx_huge_pages[mmu_type].pages;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
-	write_lock(&kvm->mmu_lock);
+	if (is_tdp_mmu)
+		read_lock(&kvm->mmu_lock);
+	else
+		write_lock(&kvm->mmu_lock);
 
 	/*
 	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
@@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 	rcu_read_lock();
 
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(nx_huge_pages))
+#ifdef CONFIG_X86_64
+		if (is_tdp_mmu)
+			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+#endif
+		if (list_empty(nx_huge_pages)) {
+#ifdef CONFIG_X86_64
+			if (is_tdp_mmu)
+				spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+#endif
 			break;
+		}
 
 		/*
 		 * We use a separate list instead of just using active_mmu_pages
@@ -7575,50 +7615,40 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
 		WARN_ON_ONCE(!sp->role.direct);
 
+		unaccount_nx_huge_page(kvm, sp);
+
+#ifdef CONFIG_X86_64
+		if (is_tdp_mmu)
+			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+#endif
+
 		/*
-		 * Unaccount and do not attempt to recover any NX Huge Pages
-		 * that are being dirty tracked, as they would just be faulted
-		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
-		 * recovered, along with all the other huge pages in the slot,
-		 * when dirty logging is disabled.
-		 *
-		 * Since gfn_to_memslot() is relatively expensive, it helps to
-		 * skip it if it the test cannot possibly return true.  On the
-		 * other hand, if any memslot has logging enabled, chances are
-		 * good that all of them do, in which case unaccount_nx_huge_page()
-		 * is much cheaper than zapping the page.
-		 *
-		 * If a memslot update is in progress, reading an incorrect value
-		 * of kvm->nr_memslots_dirty_logging is not a problem: if it is
-		 * becoming zero, gfn_to_memslot() will be done unnecessarily; if
-		 * it is becoming nonzero, the page will be zapped unnecessarily.
-		 * Either way, this only affects efficiency in racy situations,
-		 * and not correctness.
+		 * Do not attempt to recover any NX Huge Pages that are being
+		 * dirty tracked, as they would just be faulted back in as 4KiB
+		 * pages. The NX Huge Pages in this slot will be recovered,
+		 * along with all the other huge pages in the slot, when dirty
+		 * logging is disabled.
 		 */
-		slot = NULL;
-		if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
-			struct kvm_memslots *slots;
+		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
+			if (is_tdp_mmu)
+				flush |= kvm_tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
+			else
+				kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 
-			slots = kvm_memslots_for_spte_role(kvm, sp->role);
-			slot = __gfn_to_memslot(slots, sp->gfn);
-			WARN_ON_ONCE(!slot);
 		}
 
-		if (slot && kvm_slot_dirty_track_enabled(slot))
-			unaccount_nx_huge_page(kvm, sp);
-		else if (mmu_type == KVM_TDP_MMU)
-			flush |= kvm_tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
-		else
-			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 			rcu_read_unlock();
 
-			cond_resched_rwlock_write(&kvm->mmu_lock);
-			flush = false;
+			if (is_tdp_mmu)
+				cond_resched_rwlock_read(&kvm->mmu_lock);
+			else
+				cond_resched_rwlock_write(&kvm->mmu_lock);
 
+			flush = false;
 			rcu_read_lock();
 		}
 	}
@@ -7626,7 +7656,10 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 
 	rcu_read_unlock();
 
-	write_unlock(&kvm->mmu_lock);
+	if (is_tdp_mmu)
+		read_unlock(&kvm->mmu_lock);
+	else
+		write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 19907eb04a9c4..31d921705dee7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -928,21 +928,49 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
 					   struct kvm_mmu_page *sp)
 {
-	u64 old_spte;
+	struct tdp_iter iter = {
+		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
+		.sptep = sp->ptep,
+		.level = sp->role.level + 1,
+		.gfn = sp->gfn,
+		.as_id = kvm_mmu_page_as_id(sp),
+	};
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	if (WARN_ON_ONCE(!is_tdp_mmu_page(sp)))
+		return false;
 
 	/*
-	 * This helper intentionally doesn't allow zapping a root shadow page,
-	 * which doesn't have a parent page table and thus no associated entry.
+	 * Root shadow pages don't have a parent page table and thus no
+	 * associated entry, but they can never be possible NX huge pages.
 	 */
 	if (WARN_ON_ONCE(!sp->ptep))
 		return false;
 
-	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
+	/*
+	 * Since mmu_lock is held in read mode, it's possible another task has
+	 * already modified the SPTE. Zap the SPTE if and only if the SPTE
+	 * points at the SP's page table, as checking shadow-present isn't
+	 * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even
+	 * another SP. Note, spte_to_child_pt() also checks that the SPTE is
+	 * shadow-present, i.e. guards against zapping a frozen SPTE.
+	 */
+	if ((tdp_ptep_t)sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
 		return false;
 
-	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
-			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
+	/*
+	 * If a different task modified the SPTE, then it should be impossible
+	 * for the SPTE to still be used for the to-be-zapped SP. Non-leaf
+	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
+	 * creating non-leaf SPTEs, and all other bits are immutable for non-
+	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
+	 * zapping and replacement.
+	 */
+	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
+		WARN_ON_ONCE((tdp_ptep_t)sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
+		return false;
+	}
 
 	return true;
 }
-- 
2.50.0.727.gbf7dc18ff4-goog

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by Sean Christopherson 2 months, 2 weeks ago

On Mon, Jul 07, 2025, James Houghton wrote:
> From: Vipin Sharma <vipinsh@google.com>
> 
> Use MMU read lock to recover TDP MMU NX huge pages. Iterate

Wrap at ~75 chars.

> over the huge pages list under tdp_mmu_pages_lock protection and
> unaccount the page before dropping the lock.
> 
> We must not zap an SPTE if:

No pronouns!

> - The SPTE is a root page.
> - The SPTE does not point at the SP's page table.
> 
> If the SPTE does not point at the SP's page table, then something else
> has change the SPTE, so we cannot safely zap it.
> 
> Warn if zapping SPTE fails and current SPTE is still pointing to same
> page table. This should never happen.
> 
> There is always a race between dirty logging, vCPU faults, and NX huge
> page recovery for backing a gfn by an NX huge page or an executable
> small page. Unaccounting sooner during the list traversal is increasing
> the window of that race. Functionally, it is okay, because accounting
> doesn't protect against iTLB multi-hit bug, it is there purely to
> prevent KVM from bouncing a gfn between two page sizes. The only
> downside is that a vCPU will end up doing more work in tearing down all
> the child SPTEs. This should be a very rare race.
> 
> Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> read lock. This optimizaion is done to solve a guest jitter issue on
> Windows VM which was observing an increase in network latency.

With slight tweaking:

Use MMU read lock to recover TDP MMU NX huge pages.  To prevent
concurrent modification of the list of potential huge pages, iterate over
the list under tdp_mmu_pages_lock protection and unaccount the page
before dropping the lock.

Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
read lock, which solves a guest jitter issue on Windows VMs which were
observing an increase in network latency.

Do not zap an SPTE if:
- The SPTE is a root page.
- The SPTE does not point at the SP's page table.

If the SPTE does not point at the SP's page table, then something else
has change the SPTE, so KVM cannot safely zap it.

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table, as it should be impossible for the CMPXCHG to fail due to all
other write scenarios being mutually exclusive.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an executable
small page.  Unaccounting sooner during the list traversal increases the
window of that race, but functionally, it is okay.  Accounting doesn't
protect against iTLB multi-hit bug, it is there purely to prevent KVM
from bouncing a gfn between two page sizes. The only  downside is that a
vCPU will end up doing more work in tearing down all  the child SPTEs.
This should be a very rare race.

> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> Co-developed-by: James Houghton <jthoughton@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
>  arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
>  2 files changed, 105 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b074f7bb5cc58..7df1b4ead705b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
>  	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
>  }
>  
> +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> +					     struct kvm_mmu_page *sp)
> +{
> +	struct kvm_memory_slot *slot = NULL;
> +
> +	/*
> +	 * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
> +	 * it the test cannot possibly return true.  On the other hand, if any
> +	 * memslot has logging enabled, chances are good that all of them do, in
> +	 * which case unaccount_nx_huge_page() is much cheaper than zapping the
> +	 * page.

And largely irrelevant, because KVM should unaccount the NX no matter what.  I
kinda get what you're saying, but honestly it adds a lot of confusion, especially
since unaccount_nx_huge_page() is in the caller.

> +	 *
> +	 * If a memslot update is in progress, reading an incorrect value of
> +	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> +	 * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
> +	 * nonzero, the page will be zapped unnecessarily.  Either way, this only
> +	 * affects efficiency in racy situations, and not correctness.
> +	 */
> +	if (atomic_read(&kvm->nr_memslots_dirty_logging)) {

Short-circuit the function to decrease indentation, and so that "slot" doesn't
need to be NULL-initialized.

> +		struct kvm_memslots *slots;
> +
> +		slots = kvm_memslots_for_spte_role(kvm, sp->role);
> +		slot = __gfn_to_memslot(slots, sp->gfn);

Then this can be:

	slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);

without creating a stupid-long line.

> +		WARN_ON_ONCE(!slot);

And then:

	if (WARN_ON_ONCE(!slot))
		return false;

	return kvm_slot_dirty_track_enabled(slot);

With a comment cleanup:

	struct kvm_memory_slot *slot;

	/*
	 * Skip the memslot lookup if dirty tracking can't possibly be enabled,
	 * as memslot lookups are relatively expensive.
	 *
	 * If a memslot update is in progress, reading an incorrect value of
	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
	 * zero, KVM will  do an unnecessary memslot lookup;  if it is becoming
	 * nonzero, the page will be zapped unnecessarily.  Either way, this
	 * only affects efficiency in racy situations, and not correctness.
	 */
	if (!atomic_read(&kvm->nr_memslots_dirty_logging))
		return false;

	slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
	if (WARN_ON_ONCE(!slot))
		return false;

	return kvm_slot_dirty_track_enabled(slot);
> @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>  	rcu_read_lock();
>  
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(nx_huge_pages))
> +#ifdef CONFIG_X86_64

These #ifdefs still make me sad, but I also still think they're the least awful
solution.  And hopefully we will jettison 32-bit sooner than later :-)

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by James Houghton 2 months, 1 week ago

On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > From: Vipin Sharma <vipinsh@google.com>
> >
> > Use MMU read lock to recover TDP MMU NX huge pages. Iterate
>
> Wrap at ~75 chars.

Oh, I did indeed wrap the text pretty aggressively for this patch. Not
sure why that happened.

>
> > over the huge pages list under tdp_mmu_pages_lock protection and
> > unaccount the page before dropping the lock.
> >
> > We must not zap an SPTE if:
>
> No pronouns!

Right.

>
> > - The SPTE is a root page.
> > - The SPTE does not point at the SP's page table.
> >
> > If the SPTE does not point at the SP's page table, then something else
> > has change the SPTE, so we cannot safely zap it.
> >
> > Warn if zapping SPTE fails and current SPTE is still pointing to same
> > page table. This should never happen.
> >
> > There is always a race between dirty logging, vCPU faults, and NX huge
> > page recovery for backing a gfn by an NX huge page or an executable
> > small page. Unaccounting sooner during the list traversal is increasing
> > the window of that race. Functionally, it is okay, because accounting
> > doesn't protect against iTLB multi-hit bug, it is there purely to
> > prevent KVM from bouncing a gfn between two page sizes. The only
> > downside is that a vCPU will end up doing more work in tearing down all
> > the child SPTEs. This should be a very rare race.
> >
> > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> > read lock. This optimizaion is done to solve a guest jitter issue on
> > Windows VM which was observing an increase in network latency.
>
> With slight tweaking:
>
> Use MMU read lock to recover TDP MMU NX huge pages.  To prevent
> concurrent modification of the list of potential huge pages, iterate over
> the list under tdp_mmu_pages_lock protection and unaccount the page
> before dropping the lock.
>
> Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> read lock, which solves a guest jitter issue on Windows VMs which were
> observing an increase in network latency.
>
> Do not zap an SPTE if:
> - The SPTE is a root page.
> - The SPTE does not point at the SP's page table.
>
> If the SPTE does not point at the SP's page table, then something else
> has change the SPTE, so KVM cannot safely zap it.

"has changed" (my mistake)

>
> Warn if zapping SPTE fails and current SPTE is still pointing to same
> page table, as it should be impossible for the CMPXCHG to fail due to all
> other write scenarios being mutually exclusive.
>
> There is always a race between dirty logging, vCPU faults, and NX huge
> page recovery for backing a gfn by an NX huge page or an executable
> small page.  Unaccounting sooner during the list traversal increases the
> window of that race, but functionally, it is okay.  Accounting doesn't
> protect against iTLB multi-hit bug, it is there purely to prevent KVM
> from bouncing a gfn between two page sizes. The only  downside is that a
> vCPU will end up doing more work in tearing down all  the child SPTEs.
> This should be a very rare race.

Thanks, this is much better.

>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > Co-developed-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
> >  arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
> >  2 files changed, 105 insertions(+), 44 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b074f7bb5cc58..7df1b4ead705b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
> >       return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
> >  }
> >
> > +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > +                                          struct kvm_mmu_page *sp)
> > +{
> > +     struct kvm_memory_slot *slot = NULL;
> > +
> > +     /*
> > +      * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
> > +      * it the test cannot possibly return true.  On the other hand, if any
> > +      * memslot has logging enabled, chances are good that all of them do, in
> > +      * which case unaccount_nx_huge_page() is much cheaper than zapping the
> > +      * page.
>
> And largely irrelevant, because KVM should unaccount the NX no matter what.  I
> kinda get what you're saying, but honestly it adds a lot of confusion, especially
> since unaccount_nx_huge_page() is in the caller.
>
> > +      *
> > +      * If a memslot update is in progress, reading an incorrect value of
> > +      * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> > +      * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
> > +      * nonzero, the page will be zapped unnecessarily.  Either way, this only
> > +      * affects efficiency in racy situations, and not correctness.
> > +      */
> > +     if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
>
> Short-circuit the function to decrease indentation, and so that "slot" doesn't
> need to be NULL-initialized.
>
> > +             struct kvm_memslots *slots;
> > +
> > +             slots = kvm_memslots_for_spte_role(kvm, sp->role);
> > +             slot = __gfn_to_memslot(slots, sp->gfn);
>
> Then this can be:
>
>         slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
>
> without creating a stupid-long line.
>
> > +             WARN_ON_ONCE(!slot);
>
> And then:
>
>         if (WARN_ON_ONCE(!slot))
>                 return false;
>
>         return kvm_slot_dirty_track_enabled(slot);
>
> With a comment cleanup:
>
>         struct kvm_memory_slot *slot;
>
>         /*
>          * Skip the memslot lookup if dirty tracking can't possibly be enabled,
>          * as memslot lookups are relatively expensive.
>          *
>          * If a memslot update is in progress, reading an incorrect value of
>          * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
>          * zero, KVM will  do an unnecessary memslot lookup;  if it is becoming
>          * nonzero, the page will be zapped unnecessarily.  Either way, this
>          * only affects efficiency in racy situations, and not correctness.
>          */
>         if (!atomic_read(&kvm->nr_memslots_dirty_logging))
>                 return false;
>
>         slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
>         if (WARN_ON_ONCE(!slot))
>                 return false;
>
>         return kvm_slot_dirty_track_enabled(slot);

LGTM, thanks!

> > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >       rcu_read_lock();
> >
> >       for ( ; to_zap; --to_zap) {
> > -             if (list_empty(nx_huge_pages))
> > +#ifdef CONFIG_X86_64
>
> These #ifdefs still make me sad, but I also still think they're the least awful
> solution.  And hopefully we will jettison 32-bit sooner than later :-)

Yeah I couldn't come up with anything better. :(

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by David Matlack 2 months, 1 week ago

On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >       rcu_read_lock();
> > >
> > >       for ( ; to_zap; --to_zap) {
> > > -             if (list_empty(nx_huge_pages))
> > > +#ifdef CONFIG_X86_64
> >
> > These #ifdefs still make me sad, but I also still think they're the least awful
> > solution.  And hopefully we will jettison 32-bit sooner than later :-)
>
> Yeah I couldn't come up with anything better. :(

Could we just move the definition of tdp_mmu_pages_lock outside of
CONFIG_X86_64? The only downside I can think of is slightly larger kvm
structs for 32-bit builds.

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by Sean Christopherson 2 months, 1 week ago

On Mon, Jul 28, 2025, David Matlack wrote:
> On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> > On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >       rcu_read_lock();
> > > >
> > > >       for ( ; to_zap; --to_zap) {
> > > > -             if (list_empty(nx_huge_pages))
> > > > +#ifdef CONFIG_X86_64
> > >
> > > These #ifdefs still make me sad, but I also still think they're the least awful
> > > solution.  And hopefully we will jettison 32-bit sooner than later :-)
> >
> > Yeah I couldn't come up with anything better. :(
> 
> Could we just move the definition of tdp_mmu_pages_lock outside of
> CONFIG_X86_64? The only downside I can think of is slightly larger kvm
> structs for 32-bit builds.

Hmm, I was going to say "no, because we'd also need to do spin_lock_init()", but
obviously spin_(un)lock() will only ever be invoked for 64-bit kernels.  I still
don't love the idea of making tdp_mmu_pages_lock visible outside of CONFIG_X86_64,
it feels like we're just asking to introduce (likely benign) bugs.

Ugh, and I just noticed this as well:

  #ifndef CONFIG_X86_64
  #define KVM_TDP_MMU -1
  #endif

Rather than expose kvm->arch.tdp_mmu_pages_lock, what about using a single #ifdef
section to bury both is_tdp_mmu and a local kvm->arch.tdp_mmu_pages_lock pointer?

Alternatively, we could do:

	const bool is_tdp_mmu = IS_ENABLED(CONFIG_X86_64) && mmu_type != KVM_SHADOW_MMU;

to avoid referencing KVM_TDP_MMU, but that's quite ugly.  Overall, I think the
below strikes the best balance between polluting the code with #ifdefs, and
generating robust code.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 52bf6a886bfd..c038d7cd187d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1372,10 +1372,6 @@ enum kvm_mmu_type {
        KVM_NR_MMU_TYPES,
 };
 
-#ifndef CONFIG_X86_64
-#define KVM_TDP_MMU -1
-#endif
-
 struct kvm_arch {
        unsigned long n_used_mmu_pages;
        unsigned long n_requested_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a6a1fb42b2d1..e2bde6a5e346 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
 static void kvm_recover_nx_huge_pages(struct kvm *kvm,
                                      const enum kvm_mmu_type mmu_type)
 {
+#ifdef CONFIG_X86_64
+       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
+       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
+#else
+       const bool is_tdp_mmu = false;
+       spinlock_t *tdp_mmu_pages_lock = NULL;
+#endif
        unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
-       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
        struct list_head *nx_huge_pages;
        struct kvm_mmu_page *sp;
        LIST_HEAD(invalid_list);
@@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
        rcu_read_lock();
 
        for ( ; to_zap; --to_zap) {
-#ifdef CONFIG_X86_64
                if (is_tdp_mmu)
-                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-#endif
+                       spin_lock(tdp_mmu_pages_lock);
+
                if (list_empty(nx_huge_pages)) {
-#ifdef CONFIG_X86_64
                        if (is_tdp_mmu)
-                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-#endif
+                               spin_unlock(tdp_mmu_pages_lock);
                        break;
                }
 
@@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 
                unaccount_nx_huge_page(kvm, sp);
 
-#ifdef CONFIG_X86_64
                if (is_tdp_mmu)
-                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-#endif
+                       spin_unlock(tdp_mmu_pages_lock);
 
                /*
                 * Do not attempt to recover any NX Huge Pages that are being
--

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by James Houghton 2 months, 1 week ago

On Mon, Jul 28, 2025 at 2:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jul 28, 2025, David Matlack wrote:
> > On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> > > On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > > >       rcu_read_lock();
> > > > >
> > > > >       for ( ; to_zap; --to_zap) {
> > > > > -             if (list_empty(nx_huge_pages))
> > > > > +#ifdef CONFIG_X86_64
> > > >
> > > > These #ifdefs still make me sad, but I also still think they're the least awful
> > > > solution.  And hopefully we will jettison 32-bit sooner than later :-)
> > >
> > > Yeah I couldn't come up with anything better. :(
> >
> > Could we just move the definition of tdp_mmu_pages_lock outside of
> > CONFIG_X86_64? The only downside I can think of is slightly larger kvm
> > structs for 32-bit builds.
>
> Hmm, I was going to say "no, because we'd also need to do spin_lock_init()", but
> obviously spin_(un)lock() will only ever be invoked for 64-bit kernels.  I still
> don't love the idea of making tdp_mmu_pages_lock visible outside of CONFIG_X86_64,
> it feels like we're just asking to introduce (likely benign) bugs.
>
> Ugh, and I just noticed this as well:
>
>   #ifndef CONFIG_X86_64
>   #define KVM_TDP_MMU -1
>   #endif
>
> Rather than expose kvm->arch.tdp_mmu_pages_lock, what about using a single #ifdef
> section to bury both is_tdp_mmu and a local kvm->arch.tdp_mmu_pages_lock pointer?

SGTM.

>
> Alternatively, we could do:
>
>         const bool is_tdp_mmu = IS_ENABLED(CONFIG_X86_64) && mmu_type != KVM_SHADOW_MMU;

I tried something like this before and it didn't work; my compiler
still complained. Maybe I didn't do it quite right...

>
> to avoid referencing KVM_TDP_MMU, but that's quite ugly.  Overall, I think the
> below strikes the best balance between polluting the code with #ifdefs, and
> generating robust code.
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 52bf6a886bfd..c038d7cd187d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1372,10 +1372,6 @@ enum kvm_mmu_type {
>         KVM_NR_MMU_TYPES,
>  };
>
> -#ifndef CONFIG_X86_64
> -#define KVM_TDP_MMU -1
> -#endif
> -
>  struct kvm_arch {
>         unsigned long n_used_mmu_pages;
>         unsigned long n_requested_mmu_pages;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a6a1fb42b2d1..e2bde6a5e346 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
>  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>                                       const enum kvm_mmu_type mmu_type)
>  {
> +#ifdef CONFIG_X86_64
> +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> +#else
> +       const bool is_tdp_mmu = false;
> +       spinlock_t *tdp_mmu_pages_lock = NULL;
> +#endif
>         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
>         struct list_head *nx_huge_pages;
>         struct kvm_mmu_page *sp;
>         LIST_HEAD(invalid_list);
> @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>         rcu_read_lock();
>
>         for ( ; to_zap; --to_zap) {
> -#ifdef CONFIG_X86_64
>                 if (is_tdp_mmu)
> -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> -#endif
> +                       spin_lock(tdp_mmu_pages_lock);
> +
>                 if (list_empty(nx_huge_pages)) {
> -#ifdef CONFIG_X86_64
>                         if (is_tdp_mmu)
> -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> -#endif
> +                               spin_unlock(tdp_mmu_pages_lock);
>                         break;
>                 }
>
> @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>
>                 unaccount_nx_huge_page(kvm, sp);
>
> -#ifdef CONFIG_X86_64
>                 if (is_tdp_mmu)
> -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> -#endif
> +                       spin_unlock(tdp_mmu_pages_lock);
>
>                 /*
>                  * Do not attempt to recover any NX Huge Pages that are being
> --

LGTM! Thanks Sean.

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by David Matlack 2 months, 1 week ago

On Mon, Jul 28, 2025 at 2:49 PM James Houghton <jthoughton@google.com> wrote:
>
> On Mon, Jul 28, 2025 at 2:38 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Jul 28, 2025, David Matlack wrote:
> > > On Mon, Jul 28, 2025 at 11:08 AM James Houghton <jthoughton@google.com> wrote:
> > > > On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > > > >       rcu_read_lock();
> > > > > >
> > > > > >       for ( ; to_zap; --to_zap) {
> > > > > > -             if (list_empty(nx_huge_pages))
> > > > > > +#ifdef CONFIG_X86_64
> > > > >
> > > > > These #ifdefs still make me sad, but I also still think they're the least awful
> > > > > solution.  And hopefully we will jettison 32-bit sooner than later :-)
> > > >
> > > > Yeah I couldn't come up with anything better. :(
> > >
> > > Could we just move the definition of tdp_mmu_pages_lock outside of
> > > CONFIG_X86_64? The only downside I can think of is slightly larger kvm
> > > structs for 32-bit builds.
> >
> > Hmm, I was going to say "no, because we'd also need to do spin_lock_init()", but
> > obviously spin_(un)lock() will only ever be invoked for 64-bit kernels.  I still
> > don't love the idea of making tdp_mmu_pages_lock visible outside of CONFIG_X86_64,
> > it feels like we're just asking to introduce (likely benign) bugs.
> >
> > Ugh, and I just noticed this as well:
> >
> >   #ifndef CONFIG_X86_64
> >   #define KVM_TDP_MMU -1
> >   #endif
> >
> > Rather than expose kvm->arch.tdp_mmu_pages_lock, what about using a single #ifdef
> > section to bury both is_tdp_mmu and a local kvm->arch.tdp_mmu_pages_lock pointer?
>
> SGTM.
>
> >
> > Alternatively, we could do:
> >
> >         const bool is_tdp_mmu = IS_ENABLED(CONFIG_X86_64) && mmu_type != KVM_SHADOW_MMU;
>
> I tried something like this before and it didn't work; my compiler
> still complained. Maybe I didn't do it quite right...
>
> >
> > to avoid referencing KVM_TDP_MMU, but that's quite ugly.  Overall, I think the
> > below strikes the best balance between polluting the code with #ifdefs, and
> > generating robust code.
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 52bf6a886bfd..c038d7cd187d 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1372,10 +1372,6 @@ enum kvm_mmu_type {
> >         KVM_NR_MMU_TYPES,
> >  };
> >
> > -#ifndef CONFIG_X86_64
> > -#define KVM_TDP_MMU -1
> > -#endif
> > -
> >  struct kvm_arch {
> >         unsigned long n_used_mmu_pages;
> >         unsigned long n_requested_mmu_pages;
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a6a1fb42b2d1..e2bde6a5e346 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> >  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >                                       const enum kvm_mmu_type mmu_type)
> >  {
> > +#ifdef CONFIG_X86_64
> > +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> > +#else
> > +       const bool is_tdp_mmu = false;
> > +       spinlock_t *tdp_mmu_pages_lock = NULL;
> > +#endif
> >         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> > -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> >         struct list_head *nx_huge_pages;
> >         struct kvm_mmu_page *sp;
> >         LIST_HEAD(invalid_list);
> > @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >         rcu_read_lock();
> >
> >         for ( ; to_zap; --to_zap) {
> > -#ifdef CONFIG_X86_64
> >                 if (is_tdp_mmu)
> > -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > -#endif
> > +                       spin_lock(tdp_mmu_pages_lock);
> > +
> >                 if (list_empty(nx_huge_pages)) {
> > -#ifdef CONFIG_X86_64
> >                         if (is_tdp_mmu)
> > -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > -#endif
> > +                               spin_unlock(tdp_mmu_pages_lock);
> >                         break;
> >                 }
> >
> > @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >
> >                 unaccount_nx_huge_page(kvm, sp);
> >
> > -#ifdef CONFIG_X86_64
> >                 if (is_tdp_mmu)
> > -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > -#endif
> > +                       spin_unlock(tdp_mmu_pages_lock);
> >
> >                 /*
> >                  * Do not attempt to recover any NX Huge Pages that are being
> > --
>
> LGTM! Thanks Sean.

Is the compiler not smart enough to optimize out
kvm->arch.tdp_mmu_pages_lock? (To avoid needing the extra local
variable?) I thought there was some other KVM code that relied on
similar optimizations but I would have to go dig them up to remember.

Either way, this LGTM!

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by Sean Christopherson 2 months ago

On Fri, Aug 01, 2025, David Matlack wrote:
> On Mon, Jul 28, 2025 at 2:49 PM James Houghton <jthoughton@google.com> wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index a6a1fb42b2d1..e2bde6a5e346 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > >  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >                                       const enum kvm_mmu_type mmu_type)
> > >  {
> > > +#ifdef CONFIG_X86_64
> > > +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > > +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> > > +#else
> > > +       const bool is_tdp_mmu = false;
> > > +       spinlock_t *tdp_mmu_pages_lock = NULL;
> > > +#endif
> > >         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> > > -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > >         struct list_head *nx_huge_pages;
> > >         struct kvm_mmu_page *sp;
> > >         LIST_HEAD(invalid_list);
> > > @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >         rcu_read_lock();
> > >
> > >         for ( ; to_zap; --to_zap) {
> > > -#ifdef CONFIG_X86_64
> > >                 if (is_tdp_mmu)
> > > -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > -#endif
> > > +                       spin_lock(tdp_mmu_pages_lock);
> > > +
> > >                 if (list_empty(nx_huge_pages)) {
> > > -#ifdef CONFIG_X86_64
> > >                         if (is_tdp_mmu)
> > > -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > -#endif
> > > +                               spin_unlock(tdp_mmu_pages_lock);
> > >                         break;
> > >                 }
> > >
> > > @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > >
> > >                 unaccount_nx_huge_page(kvm, sp);
> > >
> > > -#ifdef CONFIG_X86_64
> > >                 if (is_tdp_mmu)
> > > -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > -#endif
> > > +                       spin_unlock(tdp_mmu_pages_lock);
> > >
> > >                 /*
> > >                  * Do not attempt to recover any NX Huge Pages that are being
> > > --
> >
> > LGTM! Thanks Sean.
> 
> Is the compiler not smart enough to optimize out kvm->arch.tdp_mmu_pages_lock?

Yes, the compiler will eliminate dead code at most optimization levels.  But that
optimization phase happens after initial compilation, i.e. the compiler needs to
generate the (probably intermediate?) code before it can trim away paths that are
unreachable.

> (To avoid needing the extra local variable?) I thought there was some other
> KVM code that relied on similar optimizations but I would have to go dig them
> up to remember.

KVM, and the kernel, absolutely relies on dead code elimination.  KVM most blatantly
uses the technique to avoid _defining_ stubs for code that is guarded by a Kconfig,
e.g. all of these functions are defined in sev.c (guarded by CONFIG_KVM_AMD_SEV),
but callers are guarded only with sev_guest() or sev_es_guest(), not with explicit
#idefs.

There are no build errors because the function calls aren't fully resolved until
link time (when svm.o is linked into kvm-amd.o).  But KVM still needs to _declare_
the functions, otherwise the compiler would fail during its initial code generation.

  int pre_sev_run(struct vcpu_svm *svm, int cpu);
  void sev_init_vmcb(struct vcpu_svm *svm);
  void sev_vcpu_after_set_cpuid(struct vcpu_svm *svm);
  int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in);
  void sev_es_vcpu_reset(struct vcpu_svm *svm);
  void sev_es_recalc_msr_intercepts(struct kvm_vcpu *vcpu);
  void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
  void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_area *hostsa);
  void sev_es_unmap_ghcb(struct vcpu_svm *svm);

Other notable "users" of dead code elimination are the BUILD_BUG_ON() family of
compile-time asserts.  So long as the condition can be resolved to a constant
false value during compile time, the "call" to __compiletime_error() will be
elided and everyone is happy.

  #ifdef __OPTIMIZE__
  /*
   * #ifdef __OPTIMIZE__ is only a good approximation; for instance "make
   * CFLAGS_foo.o=-Og" defines __OPTIMIZE__, does not elide the conditional code
   * and can break compilation with wrong error message(s). Combine with
   * -U__OPTIMIZE__ when needed.
   */
  # define __compiletime_assert(condition, msg, prefix, suffix)		\
	do {								\
		/*							\
		 * __noreturn is needed to give the compiler enough	\
		 * information to avoid certain possibly-uninitialized	\
		 * warnings (regardless of the build failing).		\
		 */							\
		__noreturn extern void prefix ## suffix(void)		\
			__compiletime_error(msg);			\
		if (!(condition))					\
			prefix ## suffix();				\
	} while (0)
  #else
  # define __compiletime_assert(condition, msg, prefix, suffix) ((void)(condition))
  #endif

Note, static_assert() is different in that it's a true assertion that's resolved
early on during compilation.

 * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
 * scope, but requires the expression to be an integer constant
 * expression (i.e., it is not enough that __builtin_constant_p() is
 * true for expr).

From a previous thread related to asserts (https://lore.kernel.org/all/aFGY0KVUksf1a6xB@google.com):

 : The advantage of BUILD_BUG_ON() is that it works so long as the condition is
 : compile-time constant, whereas static_assert() requires the condition to an
 : integer constant expression.  E.g. BUILD_BUG_ON() can be used so long as the
 : condition is eventually resolved to a constant, whereas static_assert() has
 : stricter requirements.
 : 
 : E.g. the fls64() assert below is fully resolved at compile time, but isn't a
 : purely constant expression, i.e. that one *needs* to be BUILD_BUG_ON().
 : 
 : --
 : arch/x86/kvm/svm/avic.c: In function ‘avic_init_backing_page’:
 : arch/x86/kvm/svm/avic.c:293:45: error: expression in static assertion is not constant
 :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
 : include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
 :    78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
 :       |                                                        ^~~~
 : arch/x86/kvm/svm/avic.c:293:9: note: in expansion of macro ‘static_assert’
 :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
 :       |         ^~~~~~~~~~~~~
 : make[5]: *** [scripts/Makefile.build:203: arch/x86/kvm/svm/avic.o] Error 1
 : --
 : 
 : The downside of BUILD_BUG_ON() is that it can't be used at global scope, i.e.
 : needs to be called from a function.
 : 
 : As a result, when adding an assertion in a function, using BUILD_BUG_ON() is
 : slightly preferred, because it's less likely to break in the future.  E.g. if
 : X2AVIC_MAX_PHYSICAL_ID were changed to something that is a compile-time constant,
 : but for whatever reason isn't a pure integer constant.

Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Posted by David Matlack 1 month, 3 weeks ago

On Fri, Aug 1, 2025 at 3:00 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Aug 01, 2025, David Matlack wrote:
> > On Mon, Jul 28, 2025 at 2:49 PM James Houghton <jthoughton@google.com> wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index a6a1fb42b2d1..e2bde6a5e346 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -7624,8 +7624,14 @@ static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > > >  static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >                                       const enum kvm_mmu_type mmu_type)
> > > >  {
> > > > +#ifdef CONFIG_X86_64
> > > > +       const bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > > > +       spinlock_t *tdp_mmu_pages_lock = &kvm->arch.tdp_mmu_pages_lock;
> > > > +#else
> > > > +       const bool is_tdp_mmu = false;
> > > > +       spinlock_t *tdp_mmu_pages_lock = NULL;
> > > > +#endif
> > > >         unsigned long to_zap = nx_huge_pages_to_zap(kvm, mmu_type);
> > > > -       bool is_tdp_mmu = mmu_type == KVM_TDP_MMU;
> > > >         struct list_head *nx_huge_pages;
> > > >         struct kvm_mmu_page *sp;
> > > >         LIST_HEAD(invalid_list);
> > > > @@ -7648,15 +7654,12 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >         rcu_read_lock();
> > > >
> > > >         for ( ; to_zap; --to_zap) {
> > > > -#ifdef CONFIG_X86_64
> > > >                 if (is_tdp_mmu)
> > > > -                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > > -#endif
> > > > +                       spin_lock(tdp_mmu_pages_lock);
> > > > +
> > > >                 if (list_empty(nx_huge_pages)) {
> > > > -#ifdef CONFIG_X86_64
> > > >                         if (is_tdp_mmu)
> > > > -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > > -#endif
> > > > +                               spin_unlock(tdp_mmu_pages_lock);
> > > >                         break;
> > > >                 }
> > > >
> > > > @@ -7675,10 +7678,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > > >
> > > >                 unaccount_nx_huge_page(kvm, sp);
> > > >
> > > > -#ifdef CONFIG_X86_64
> > > >                 if (is_tdp_mmu)
> > > > -                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > > -#endif
> > > > +                       spin_unlock(tdp_mmu_pages_lock);
> > > >
> > > >                 /*
> > > >                  * Do not attempt to recover any NX Huge Pages that are being
> > > > --
> > >
> > > LGTM! Thanks Sean.
> >
> > Is the compiler not smart enough to optimize out kvm->arch.tdp_mmu_pages_lock?
>
> Yes, the compiler will eliminate dead code at most optimization levels.  But that
> optimization phase happens after initial compilation, i.e. the compiler needs to
> generate the (probably intermediate?) code before it can trim away paths that are
> unreachable.
>
> > (To avoid needing the extra local variable?) I thought there was some other
> > KVM code that relied on similar optimizations but I would have to go dig them
> > up to remember.
>
> KVM, and the kernel, absolutely relies on dead code elimination.  KVM most blatantly
> uses the technique to avoid _defining_ stubs for code that is guarded by a Kconfig,
> e.g. all of these functions are defined in sev.c (guarded by CONFIG_KVM_AMD_SEV),
> but callers are guarded only with sev_guest() or sev_es_guest(), not with explicit
> #idefs.
>
> There are no build errors because the function calls aren't fully resolved until
> link time (when svm.o is linked into kvm-amd.o).  But KVM still needs to _declare_
> the functions, otherwise the compiler would fail during its initial code generation.
>
>   int pre_sev_run(struct vcpu_svm *svm, int cpu);
>   void sev_init_vmcb(struct vcpu_svm *svm);
>   void sev_vcpu_after_set_cpuid(struct vcpu_svm *svm);
>   int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in);
>   void sev_es_vcpu_reset(struct vcpu_svm *svm);
>   void sev_es_recalc_msr_intercepts(struct kvm_vcpu *vcpu);
>   void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
>   void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_area *hostsa);
>   void sev_es_unmap_ghcb(struct vcpu_svm *svm);
>
> Other notable "users" of dead code elimination are the BUILD_BUG_ON() family of
> compile-time asserts.  So long as the condition can be resolved to a constant
> false value during compile time, the "call" to __compiletime_error() will be
> elided and everyone is happy.
>
>   #ifdef __OPTIMIZE__
>   /*
>    * #ifdef __OPTIMIZE__ is only a good approximation; for instance "make
>    * CFLAGS_foo.o=-Og" defines __OPTIMIZE__, does not elide the conditional code
>    * and can break compilation with wrong error message(s). Combine with
>    * -U__OPTIMIZE__ when needed.
>    */
>   # define __compiletime_assert(condition, msg, prefix, suffix)         \
>         do {                                                            \
>                 /*                                                      \
>                  * __noreturn is needed to give the compiler enough     \
>                  * information to avoid certain possibly-uninitialized  \
>                  * warnings (regardless of the build failing).          \
>                  */                                                     \
>                 __noreturn extern void prefix ## suffix(void)           \
>                         __compiletime_error(msg);                       \
>                 if (!(condition))                                       \
>                         prefix ## suffix();                             \
>         } while (0)
>   #else
>   # define __compiletime_assert(condition, msg, prefix, suffix) ((void)(condition))
>   #endif
>
> Note, static_assert() is different in that it's a true assertion that's resolved
> early on during compilation.
>
>  * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
>  * scope, but requires the expression to be an integer constant
>  * expression (i.e., it is not enough that __builtin_constant_p() is
>  * true for expr).
>
>
> From a previous thread related to asserts (https://lore.kernel.org/all/aFGY0KVUksf1a6xB@google.com):
>
>  : The advantage of BUILD_BUG_ON() is that it works so long as the condition is
>  : compile-time constant, whereas static_assert() requires the condition to an
>  : integer constant expression.  E.g. BUILD_BUG_ON() can be used so long as the
>  : condition is eventually resolved to a constant, whereas static_assert() has
>  : stricter requirements.
>  :
>  : E.g. the fls64() assert below is fully resolved at compile time, but isn't a
>  : purely constant expression, i.e. that one *needs* to be BUILD_BUG_ON().
>  :
>  : --
>  : arch/x86/kvm/svm/avic.c: In function ‘avic_init_backing_page’:
>  : arch/x86/kvm/svm/avic.c:293:45: error: expression in static assertion is not constant
>  :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
>  : include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
>  :    78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>  :       |                                                        ^~~~
>  : arch/x86/kvm/svm/avic.c:293:9: note: in expansion of macro ‘static_assert’
>  :   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
>  :       |         ^~~~~~~~~~~~~
>  : make[5]: *** [scripts/Makefile.build:203: arch/x86/kvm/svm/avic.o] Error 1
>  : --
>  :
>  : The downside of BUILD_BUG_ON() is that it can't be used at global scope, i.e.
>  : needs to be called from a function.
>  :
>  : As a result, when adding an assertion in a function, using BUILD_BUG_ON() is
>  : slightly preferred, because it's less likely to break in the future.  E.g. if
>  : X2AVIC_MAX_PHYSICAL_ID were changed to something that is a compile-time constant,
>  : but for whatever reason isn't a pure integer constant.

Thank you so much for the detailed explanation!

[PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately
[PATCH v5 2/7] KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose
[PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
[PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU
[PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance
[PATCH v5 6/7] KVM: selftests: Provide extra mmap flags in vm_mem_add()
[PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test