[v1] KVM: x86/mmu: TDX post-populate cleanups

[RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent

Posted by Sean Christopherson 1 month, 1 week ago

Use atomic64_dec_return() when decrementing the number of "pre-mapped"
S-EPT pages to ensure that the count can't go negative without KVM
noticing.  In theory, checking for '0' and then decrementing in a separate
operation could miss a 0=>-1 transition.  In practice, such a condition is
impossible because nr_premapped is protected by slots_lock, i.e. doesn't
actually need to be an atomic (that wart will be addressed shortly).

Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
ensures the VM is dead, i.e. there's no point in trying to limp along.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 88079e2d45fb..b7559ea1e353 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1774,10 +1774,9 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
-		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
 			return -EIO;
 
-		atomic64_dec(&kvm_tdx->nr_premapped);
 		return 0;
 	}
 
@@ -3162,8 +3161,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 		goto out;
 	}
 
-	if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
-		atomic64_dec(&kvm_tdx->nr_premapped);
+	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
 
 	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
 		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
-- 
2.51.0.268.g9569e192d0-goog

Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent

Posted by Ira Weiny 1 month ago

Sean Christopherson wrote:
> Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> S-EPT pages to ensure that the count can't go negative without KVM
> noticing.  In theory, checking for '0' and then decrementing in a separate
> operation could miss a 0=>-1 transition.  In practice, such a condition is
> impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> actually need to be an atomic (that wart will be addressed shortly).
> 
> Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> ensures the VM is dead, i.e. there's no point in trying to limp along.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent

Posted by Edgecombe, Rick P 1 month ago

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> S-EPT pages to ensure that the count can't go negative without KVM
> noticing.  In theory, checking for '0' and then decrementing in a separate
> operation could miss a 0=>-1 transition.  In practice, such a condition is
> impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> actually need to be an atomic (that wart will be addressed shortly).
> 
> Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> ensures the VM is dead, i.e. there's no point in trying to limp along.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

This area has gone through a lot of designs. In the v19 era PAGE.ADD got
performed deep inside the fault by stuffing the source page in the vCPU. Then we
switched to having userspace call KVM_PRE_FAULT_MEMORY manually to pre-populare
the mirror EPT, and then have TDX code look up the PFN. Then nearer the end, we
switched to current code which does something like KVM_PRE_FAULT_MEMORY
internally, then looks up what got faulted and does the PAGE.ADD. Then the
version in this series which does it even more directly.

nr_premapped got added during the KVM_PRE_FAULT_MEMORY era. I personally didn't
like it, but it was needed because userspace could do unexpected things. Now it
seems like its only purpose is to generate a KVM_BUG_ON() in
tdx_sept_zap_private_spte(). I wonder if we could drop it all together and
accept less KVM_BUG_ON() coverage. It seems weird to focus in on this specific
error case.

Yan, am I missing something?

Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent

Posted by Yan Zhao 1 month ago

On Thu, Aug 28, 2025 at 10:56:18AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> > S-EPT pages to ensure that the count can't go negative without KVM
> > noticing.  In theory, checking for '0' and then decrementing in a separate
> > operation could miss a 0=>-1 transition.  In practice, such a condition is
> > impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> > actually need to be an atomic (that wart will be addressed shortly).
> > 
> > Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> > ensures the VM is dead, i.e. there's no point in trying to limp along.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> 
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> This area has gone through a lot of designs. In the v19 era PAGE.ADD got
> performed deep inside the fault by stuffing the source page in the vCPU. Then we
> switched to having userspace call KVM_PRE_FAULT_MEMORY manually to pre-populare
> the mirror EPT, and then have TDX code look up the PFN. Then nearer the end, we
> switched to current code which does something like KVM_PRE_FAULT_MEMORY
> internally, then looks up what got faulted and does the PAGE.ADD. Then the
> version in this series which does it even more directly.
Right.
If we invoke PAGE.ADD directly in tdx_sept_set_private_spte() (similar to
PAGE.AUG), then we'll have to have some way to pass in the source page info.

So, rather than passing around the source page, we opted to record the count
of pages mapped in M-EPT while still unmapped in S-EPT, i.e.,

1. map a page in M-EPT
2. increase nr_premapped.
3. map the page in S-EPT
4. decrease nr_premapped.

If a page is zapped in M-EPT before 3, decrease nr_premapped. So if 3 is
executed successfully after zapping of the M-EPT, decrease nr_premapped too.
The unbalancing of nr_premapped due to the double decrease indicates the
mismatching between M-EPT and S-EPT.
If 3 never comes or fails, it's ok.


> nr_premapped got added during the KVM_PRE_FAULT_MEMORY era. I personally didn't
> like it, but it was needed because userspace could do unexpected things. Now it
> seems like its only purpose is to generate a KVM_BUG_ON() in
> tdx_sept_zap_private_spte(). I wonder if we could drop it all together and
> accept less KVM_BUG_ON() coverage. It seems weird to focus in on this specific
> error case.
> 
> Yan, am I missing something?
Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
code change.

Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent

Posted by Edgecombe, Rick P 1 month ago

On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote:
> Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
> code change.

When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in
TDX code compared to the rest of KVM. (even after we dropped a bunch during
development) We have to differentiate from good safety, and "safety" that is
really just propping up brittle code. Each KVM_BUG_ON() is a hint that there
might be design issues.

Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent

Posted by Sean Christopherson 1 month ago

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote:
> > Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
> > code change.
> 
> When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in
> TDX code compared to the rest of KVM. (even after we dropped a bunch during
> development) We have to differentiate from good safety, and "safety" that is
> really just propping up brittle code. Each KVM_BUG_ON() is a hint that there
> might be design issues.

Nah, I think we're good.  The majority of the asserts are on SEAMCALLs, and those
are no different than the WARN_ONCE() in vmx_insn_failed(), just spread out to
individual call sites.

Once those are out of the numbers are entirely reasonable (WARNs and KVM_BUG_ON
are both assertions against bugs, one is just guaranteed to be fatal to the VM).

  $ git grep -e KVM_BUG_ON -e WARN_ vmx/tdx.c | wc -l
  25
  $ git grep -e KVM_BUG_ON -e WARN_  | wc -l
  459

Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent

Posted by Edgecombe, Rick P 1 month ago

On Thu, 2025-08-28 at 15:33 -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> > On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote:
> > > Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
> > > code change.
> > 
> > When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in
> > TDX code compared to the rest of KVM. (even after we dropped a bunch during
> > development) We have to differentiate from good safety, and "safety" that is
> > really just propping up brittle code. Each KVM_BUG_ON() is a hint that there
> > might be design issues.
> 
> Nah, I think we're good.  The majority of the asserts are on SEAMCALLs, and those
> are no different than the WARN_ONCE() in vmx_insn_failed(), just spread out to
> individual call sites.
> 
> Once those are out of the numbers are entirely reasonable (WARNs and KVM_BUG_ON
> are both assertions against bugs, one is just guaranteed to be fatal to the VM).
> 
>   $ git grep -e KVM_BUG_ON -e WARN_ vmx/tdx.c | wc -l
>   25
>   $ git grep -e KVM_BUG_ON -e WARN_  | wc -l
>   459

Hmm, ok.