Use atomic64_dec_return() when decrementing the number of "pre-mapped"
S-EPT pages to ensure that the count can't go negative without KVM
noticing. In theory, checking for '0' and then decrementing in a separate
operation could miss a 0=>-1 transition. In practice, such a condition is
impossible because nr_premapped is protected by slots_lock, i.e. doesn't
actually need to be an atomic (that wart will be addressed shortly).
Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
ensures the VM is dead, i.e. there's no point in trying to limp along.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/tdx.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 88079e2d45fb..b7559ea1e353 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1774,10 +1774,9 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
tdx_no_vcpus_enter_stop(kvm);
}
if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
- if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+ if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
return -EIO;
- atomic64_dec(&kvm_tdx->nr_premapped);
return 0;
}
@@ -3162,8 +3161,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
goto out;
}
- if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
- atomic64_dec(&kvm_tdx->nr_premapped);
+ KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
--
2.51.0.268.g9569e192d0-goog
Sean Christopherson wrote: > Use atomic64_dec_return() when decrementing the number of "pre-mapped" > S-EPT pages to ensure that the count can't go negative without KVM > noticing. In theory, checking for '0' and then decrementing in a separate > operation could miss a 0=>-1 transition. In practice, such a condition is > impossible because nr_premapped is protected by slots_lock, i.e. doesn't > actually need to be an atomic (that wart will be addressed shortly). > > Don't bother trying to keep the count non-negative, as the KVM_BUG_ON() > ensures the VM is dead, i.e. there's no point in trying to limp along. > > Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> [snip]
On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote: > Use atomic64_dec_return() when decrementing the number of "pre-mapped" > S-EPT pages to ensure that the count can't go negative without KVM > noticing. In theory, checking for '0' and then decrementing in a separate > operation could miss a 0=>-1 transition. In practice, such a condition is > impossible because nr_premapped is protected by slots_lock, i.e. doesn't > actually need to be an atomic (that wart will be addressed shortly). > > Don't bother trying to keep the count non-negative, as the KVM_BUG_ON() > ensures the VM is dead, i.e. there's no point in trying to limp along. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> This area has gone through a lot of designs. In the v19 era PAGE.ADD got performed deep inside the fault by stuffing the source page in the vCPU. Then we switched to having userspace call KVM_PRE_FAULT_MEMORY manually to pre-populare the mirror EPT, and then have TDX code look up the PFN. Then nearer the end, we switched to current code which does something like KVM_PRE_FAULT_MEMORY internally, then looks up what got faulted and does the PAGE.ADD. Then the version in this series which does it even more directly. nr_premapped got added during the KVM_PRE_FAULT_MEMORY era. I personally didn't like it, but it was needed because userspace could do unexpected things. Now it seems like its only purpose is to generate a KVM_BUG_ON() in tdx_sept_zap_private_spte(). I wonder if we could drop it all together and accept less KVM_BUG_ON() coverage. It seems weird to focus in on this specific error case. Yan, am I missing something?
On Thu, Aug 28, 2025 at 10:56:18AM +0800, Edgecombe, Rick P wrote: > On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote: > > Use atomic64_dec_return() when decrementing the number of "pre-mapped" > > S-EPT pages to ensure that the count can't go negative without KVM > > noticing. In theory, checking for '0' and then decrementing in a separate > > operation could miss a 0=>-1 transition. In practice, such a condition is > > impossible because nr_premapped is protected by slots_lock, i.e. doesn't > > actually need to be an atomic (that wart will be addressed shortly). > > > > Don't bother trying to keep the count non-negative, as the KVM_BUG_ON() > > ensures the VM is dead, i.e. there's no point in trying to limp along. > > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > --- > > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > > This area has gone through a lot of designs. In the v19 era PAGE.ADD got > performed deep inside the fault by stuffing the source page in the vCPU. Then we > switched to having userspace call KVM_PRE_FAULT_MEMORY manually to pre-populare > the mirror EPT, and then have TDX code look up the PFN. Then nearer the end, we > switched to current code which does something like KVM_PRE_FAULT_MEMORY > internally, then looks up what got faulted and does the PAGE.ADD. Then the > version in this series which does it even more directly. Right. If we invoke PAGE.ADD directly in tdx_sept_set_private_spte() (similar to PAGE.AUG), then we'll have to have some way to pass in the source page info. So, rather than passing around the source page, we opted to record the count of pages mapped in M-EPT while still unmapped in S-EPT, i.e., 1. map a page in M-EPT 2. increase nr_premapped. 3. map the page in S-EPT 4. decrease nr_premapped. If a page is zapped in M-EPT before 3, decrease nr_premapped. So if 3 is executed successfully after zapping of the M-EPT, decrease nr_premapped too. The unbalancing of nr_premapped due to the double decrease indicates the mismatching between M-EPT and S-EPT. If 3 never comes or fails, it's ok. > nr_premapped got added during the KVM_PRE_FAULT_MEMORY era. I personally didn't > like it, but it was needed because userspace could do unexpected things. Now it > seems like its only purpose is to generate a KVM_BUG_ON() in > tdx_sept_zap_private_spte(). I wonder if we could drop it all together and > accept less KVM_BUG_ON() coverage. It seems weird to focus in on this specific > error case. > > Yan, am I missing something? Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected code change.
On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote: > Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected > code change. When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in TDX code compared to the rest of KVM. (even after we dropped a bunch during development) We have to differentiate from good safety, and "safety" that is really just propping up brittle code. Each KVM_BUG_ON() is a hint that there might be design issues.
On Thu, Aug 28, 2025, Rick P Edgecombe wrote: > On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote: > > Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected > > code change. > > When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in > TDX code compared to the rest of KVM. (even after we dropped a bunch during > development) We have to differentiate from good safety, and "safety" that is > really just propping up brittle code. Each KVM_BUG_ON() is a hint that there > might be design issues. Nah, I think we're good. The majority of the asserts are on SEAMCALLs, and those are no different than the WARN_ONCE() in vmx_insn_failed(), just spread out to individual call sites. Once those are out of the numbers are entirely reasonable (WARNs and KVM_BUG_ON are both assertions against bugs, one is just guaranteed to be fatal to the VM). $ git grep -e KVM_BUG_ON -e WARN_ vmx/tdx.c | wc -l 25 $ git grep -e KVM_BUG_ON -e WARN_ | wc -l 459
On Thu, 2025-08-28 at 15:33 -0700, Sean Christopherson wrote: > On Thu, Aug 28, 2025, Rick P Edgecombe wrote: > > On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote: > > > Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected > > > code change. > > > > When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in > > TDX code compared to the rest of KVM. (even after we dropped a bunch during > > development) We have to differentiate from good safety, and "safety" that is > > really just propping up brittle code. Each KVM_BUG_ON() is a hint that there > > might be design issues. > > Nah, I think we're good. The majority of the asserts are on SEAMCALLs, and those > are no different than the WARN_ONCE() in vmx_insn_failed(), just spread out to > individual call sites. > > Once those are out of the numbers are entirely reasonable (WARNs and KVM_BUG_ON > are both assertions against bugs, one is just guaranteed to be fatal to the VM). > > $ git grep -e KVM_BUG_ON -e WARN_ vmx/tdx.c | wc -l > 25 > $ git grep -e KVM_BUG_ON -e WARN_ | wc -l > 459 Hmm, ok.
© 2016 - 2025 Red Hat, Inc.