[v2] TDX MMU Part 2

[PATCH v2 00/24] TDX MMU Part 2

Posted by Yan Zhao 1 year, 2 months ago

Hi,

Here is v2 of the TDX “MMU part 2” series.
As discussed earlier, non-nit feedbacks from v1[0] have been applied.
- Among them, patch "KVM: TDX: MTRR: implement get_mt_mask() for TDX" was
  dropped. The feature self-snoop was not made a dependency for enabling
  TDX since checking for the feature self-snoop was not included in
  kvm_mmu_may_ignore_guest_pat() in the base code. So, strickly speaking,
  current code would incorrectly zap the mirrored root if non-coherent DMA
  devices were hot-plugged.

There were also a few minor issues noticed by me and fixed without internal
discussion (noted in each patch's version log).

It’s now ready to hand off to Paolo/kvm-coco-queue.


One remaining item that requires further discussion is "How to handle
the TDX module lock contention (i.e. SEAMCALL retry replacements)".
The basis for future discussions includes:
(1) TDH.MEM.TRACK can contend with TDH.VP.ENTER on the TD epoch lock.
(2) TDH.VP.ENTER contends with TDH.MEM* on S-EPT tree lock when 0-stepping
    mitigation is triggered.
    - The threshold of zero-step mitigation is counted per-vCPU when the
      TDX module finds that EPT violations are caused by the same RIP as
      in the last TDH.VP.ENTER for 6 consecutive times.
      The threshold value 6 is explained as 
      "There can be at most 2 mapping faults on instruction fetch
       (x86 macro-instructions length is at most 15 bytes) when the
       instruction crosses page boundary; then there can be at most 2
       mapping faults for each memory operand, when the operand crosses
       page boundary. For most of x86 macro-instructions, there are up to 2
       memory operands and each one of them is small, which brings us to
       maximum 2+2*2 = 6 legal mapping faults."
    - If the EPT violations received by KVM are caused by
      TDG.MEM.PAGE.ACCEPT, they will not trigger 0-stepping mitigation.
      Since a TD is required to call TDG.MEM.PAGE.ACCEPT before accessing a
      private memory when configured with pending_ve_disable=Y, 0-stepping
      mitigation is not expected to occur in such a TD.
(3) TDG.MEM.PAGE.ACCEPT can contend with SEAMCALLs TDH.MEM*.
    (Actually, TDG.MEM.PAGE.ATTR.RD or TDG.MEM.PAGE.ATTR.WR can also
     contend with SEAMCALLs TDH.MEM*. Although we don't need to consider
     these two TDCALLs when enabling basic TDX, they are allowed by the
     TDX module, and we can't control whether a TD invokes a TDCALL or
     not).

The "KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" is 
still in place in this series (at the tail), but we should drop it when we
finalize on the real solution.


This series has 5 commits intended to collect Acks from x86 maintainers.
These commits introduce and export SEAMCALL wrappers to allow KVM to manage
the S-EPT (the EPT that maps private memory and is protected by the TDX
module):

  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
    pages
  x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
  x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
  x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
  x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
    contents
 
This series is based off of a kvm-coco-queue commit and some pre-req
series:
1. commit ee69eb746754 ("KVM: x86/mmu: Prevent aliased memslot GFNs") (in
   kvm-coco-queue).
2. v7 of "TDX host: metadata reading tweaks, bug fix and info dump" [1].
3. v1 of "KVM: VMX: Initialize TDX when loading KVM module" [2], with some
   new feedback from Sean.
4. v2 of “TDX vCPU/VM creation” [3]
 
It requires TDX module 1.5.06.00.0744[4], or later. This is due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. Now NO_RBP_MOD is enabled (in VM/vCPU creation patches), and this
particular version of the TDX module has a required NO_RBP_MOD related bug
fix.
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").

 
The series has been tested as part of the development branch for the TDX
base series. The testing consisted of TDX kvm-unit-tests and booting a
Linux TD, and TDX enhanced KVM selftests.

The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-11.3

Matching QEMU:
https://github.com/intel-staging/qemu-tdx/commits/tdx-qemu-upstream-v6.1/

[0] https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/kvm/cover.1731318868.git.kai.huang@intel.com/#t
[2] https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com/
[3] https://lore.kernel.org/kvm/20241030190039.77971-1-rick.p.edgecombe@intel.com/
[4] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06


Isaku Yamahata (17):
  KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
  KVM: TDX: Add accessors VMX VMCS helpers
  KVM: TDX: Set gfn_direct_bits to shared bit
  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
    pages
  x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
  x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
  x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
  x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
    contents
  KVM: TDX: Require TDP MMU and mmio caching for TDX
  KVM: x86/mmu: Add setter for shadow_mmio_value
  KVM: TDX: Set per-VM shadow_mmio_value to 0
  KVM: TDX: Handle TLB tracking for TDX
  KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page
    table
  KVM: TDX: Implement hook to get max mapping level of private pages
  KVM: TDX: Add an ioctl to create initial guest memory
  KVM: TDX: Finalize VM initialization
  KVM: TDX: Handle vCPU dissociation

Rick Edgecombe (3):
  KVM: x86/mmu: Implement memslot deletion for TDX
  KVM: VMX: Teach EPT violation helper about private mem
  KVM: x86/mmu: Export kvm_tdp_map_page()

Sean Christopherson (2):
  KVM: VMX: Split out guts of EPT violation to common/exposed function
  KVM: TDX: Add load_mmu_pgd method for TDX

Yan Zhao (1):
  KVM: x86/mmu: Do not enable page track for TD guest

Yuan Yao (1):
  [HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand
    SEPT

 arch/x86/include/asm/tdx.h      |   9 +
 arch/x86/include/asm/vmx.h      |   1 +
 arch/x86/include/uapi/asm/kvm.h |  10 +
 arch/x86/kvm/mmu.h              |   4 +
 arch/x86/kvm/mmu/mmu.c          |   7 +-
 arch/x86/kvm/mmu/page_track.c   |   3 +
 arch/x86/kvm/mmu/spte.c         |   8 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  37 +-
 arch/x86/kvm/vmx/common.h       |  43 ++
 arch/x86/kvm/vmx/main.c         | 104 ++++-
 arch/x86/kvm/vmx/tdx.c          | 727 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h          |  93 ++++
 arch/x86/kvm/vmx/tdx_arch.h     |  23 +
 arch/x86/kvm/vmx/vmx.c          |  25 +-
 arch/x86/kvm/vmx/x86_ops.h      |  51 +++
 arch/x86/virt/vmx/tdx/tdx.c     | 176 ++++++++
 arch/x86/virt/vmx/tdx/tdx.h     |   8 +
 virt/kvm/kvm_main.c             |   1 +
 18 files changed, 1278 insertions(+), 52 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

-- 
2.43.2

Re: [PATCH v2 00/24] TDX MMU Part 2

Posted by Paolo Bonzini 1 year, 1 month ago

On 11/12/24 08:33, Yan Zhao wrote:
> Hi,
> 
> Here is v2 of the TDX “MMU part 2” series.
> As discussed earlier, non-nit feedbacks from v1[0] have been applied.
> - Among them, patch "KVM: TDX: MTRR: implement get_mt_mask() for TDX" was
>    dropped. The feature self-snoop was not made a dependency for enabling
>    TDX since checking for the feature self-snoop was not included in
>    kvm_mmu_may_ignore_guest_pat() in the base code. So, strickly speaking,
>    current code would incorrectly zap the mirrored root if non-coherent DMA
>    devices were hot-plugged.
> 
> There were also a few minor issues noticed by me and fixed without internal
> discussion (noted in each patch's version log).
> 
> It’s now ready to hand off to Paolo/kvm-coco-queue.
> 
> 
> One remaining item that requires further discussion is "How to handle
> the TDX module lock contention (i.e. SEAMCALL retry replacements)".
> The basis for future discussions includes:
> (1) TDH.MEM.TRACK can contend with TDH.VP.ENTER on the TD epoch lock.
> (2) TDH.VP.ENTER contends with TDH.MEM* on S-EPT tree lock when 0-stepping
>      mitigation is triggered.
>      - The threshold of zero-step mitigation is counted per-vCPU when the
>        TDX module finds that EPT violations are caused by the same RIP as
>        in the last TDH.VP.ENTER for 6 consecutive times.
>        The threshold value 6 is explained as
>        "There can be at most 2 mapping faults on instruction fetch
>         (x86 macro-instructions length is at most 15 bytes) when the
>         instruction crosses page boundary; then there can be at most 2
>         mapping faults for each memory operand, when the operand crosses
>         page boundary. For most of x86 macro-instructions, there are up to 2
>         memory operands and each one of them is small, which brings us to
>         maximum 2+2*2 = 6 legal mapping faults."
>      - If the EPT violations received by KVM are caused by
>        TDG.MEM.PAGE.ACCEPT, they will not trigger 0-stepping mitigation.
>        Since a TD is required to call TDG.MEM.PAGE.ACCEPT before accessing a
>        private memory when configured with pending_ve_disable=Y, 0-stepping
>        mitigation is not expected to occur in such a TD.
> (3) TDG.MEM.PAGE.ACCEPT can contend with SEAMCALLs TDH.MEM*.
>      (Actually, TDG.MEM.PAGE.ATTR.RD or TDG.MEM.PAGE.ATTR.WR can also
>       contend with SEAMCALLs TDH.MEM*. Although we don't need to consider
>       these two TDCALLs when enabling basic TDX, they are allowed by the
>       TDX module, and we can't control whether a TD invokes a TDCALL or
>       not).
> 
> The "KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" is
> still in place in this series (at the tail), but we should drop it when we
> finalize on the real solution.
> 
> 
> This series has 5 commits intended to collect Acks from x86 maintainers.
> These commits introduce and export SEAMCALL wrappers to allow KVM to manage
> the S-EPT (the EPT that maps private memory and is protected by the TDX
> module):
> 
>    x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
>      pages
>    x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
>    x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
>    x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
>    x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
>      contents

Apart from the possible changes to the SEAMCALL wrappers, this is in 
good shape.

Thanks,

Paolo

[RFC PATCH 0/2] SEPT SEAMCALL retry proposal

Posted by Yan Zhao 1 year, 2 months ago

This SEPT SEAMCALL retry proposal aims to remove patch
"[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
[1] at the tail of v2 series "TDX MMU Part 2".

==brief history==

In the v1 series 'TDX MMU Part 2', there were several discussions regarding
the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the
SEAMCALL wrapper tdx_seamcall_sept().

The lock status of each SEAMCALL relevant to KVM was analyzed in [2].

The conclusion was that 16 retries was necessary because
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.

  When the TDX module detects that EPT violations are caused by the same
  RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter()
  will take SEPT tree lock and contend with tdh_mem*().

- tdg_mem_page_accept() can contend with other tdh_mem*().


Sean provided several good suggestions[3], including:
- Implement retries within TDX code when the TDP MMU returns
  RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering
  0-step mitigation.
- It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*()
  inside TDX module.
- Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and
  prevent tdh_vp_enter() during page uninstallation.

Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is
insufficient to prevent 0-step mitigation [4].

Rick and Yan then consulted TDX module team with findings that:
- The threshold of zero-step mitigation is counted per vCPU.
  It's of value 6 because

    "There can be at most 2 mapping faults on instruction fetch
     (x86 macro-instructions length is at most 15 bytes) when the
     instruction crosses page boundary; then there can be at most 2
     mapping faults for each memory operand, when the operand crosses
     page boundary. For most of x86 macro-instructions, there are up to 2
     memory operands and each one of them is small, which brings us to
     maximum 2+2*2 = 6 legal mapping faults."
  
- Besides tdg_mem_page_accept(),  tdg_mem_page_attr_rd/wr() can also 
  contend with SEAMCALLs tdh_mem*().

So, we decided to make a proposal to tolerate 0-step mitigation.

==proposal details==

The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
together which are required for page installation and uninstallation.

These SEAMCALLs can be divided into three groups:
Group 1: tdh_mem_page_add().
         The SEAMCALL is invoked only during TD build time and therefore
         KVM has ensured that no contention will occur.

         Proposal: (as in patch 1)
         Just return error when TDX_OPERAND_BUSY is found.

Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
         These two SEAMCALLs are invoked for page installation. 
         They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
	 tdg_mem_page_attr_rd/wr().

         Proposal: (as in patch 1)
         - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
           to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
         
         - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
           long as there are no pending signals/interrupts.

         The retry inside TDX aims to reduce the count of tdh_vp_enter()
         before resolving EPT violations in the local vCPU, thereby
         minimizing contentions with other vCPUs. However, it can't
         completely eliminate 0-step mitigation as it exits when there're
         pending signals/interrupts and does not (and cannot) remove the
         tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.

         Resources    SHARED  users      EXCLUSIVE users
         ------------------------------------------------------------
         SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
                    tdh_mem_page_aug
         ------------------------------------------------------------
         SEPT entry                      tdh_mem_sept_add (Host lock)
                                         tdh_mem_page_aug (Host lock)
                                         tdg_mem_page_accept (Guest lock)
                                         tdg_mem_page_attr_rd (Guest lock)
                                         tdg_mem_page_attr_wr (Guest lock)

Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
         These three SEAMCALLs are invoked for page uninstallation, with
         KVM mmu_lock held for writing.

         Resources     SHARED users      EXCLUSIVE users
         ------------------------------------------------------------
         TDCS epoch    tdh_vp_enter      tdh_mem_track
         ------------------------------------------------------------
         SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
                                         tdh_mem_range_block   
         ------------------------------------------------------------
         SEPT entry                      tdh_mem_range_block (Host lock)
                                         tdh_mem_page_remove (Host lock)
                                         tdg_mem_page_accept (Guest lock)
                                         tdg_mem_page_attr_rd (Guest lock)
                                         tdg_mem_page_attr_wr (Guest lock)

         Proposal: (as in patch 2)
         - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
           once.
         - During the retry, kick off all vCPUs and prevent any vCPU from
           entering to avoid potential contentions.

         This is because tdh_vp_enter() and TDCALLs are not protected by
         KVM mmu_lock, and remove_external_spte() in KVM must not fail.



SEAMCALL                Lock Type        Resource 
-----------------------------Group 1--------------------------------
tdh_mem_page_add        EXCLUSIVE        TDR
                        NO_LOCK          TDCS
                        NO_LOCK          SEPT tree
                        EXCLUSIVE        page to add

----------------------------Group 2--------------------------------
tdh_mem_sept_add        SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to add


tdh_mem_page_aug        SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to aug

----------------------------Group 3--------------------------------
tdh_mem_range_block     SHARED           TDR
                        SHARED           TDCS
                        EXCLUSIVE        SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify

tdh_mem_track           SHARED           TDR
                        SHARED           TDCS
                        EXCLUSIVE        TDCS epoch

tdh_mem_page_remove     SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to remove


[1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
[2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
[3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com
[4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com

Yan Zhao (2):
  KVM: TDX: Retry in TDX when installing TD private/sept pages
  KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal

 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/mmu/mmu.c          |   2 +-
 arch/x86/kvm/vmx/tdx.c          | 102 +++++++++++++++++++++++++-------
 3 files changed, 85 insertions(+), 21 deletions(-)

-- 
2.43.2

Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal

Posted by Edgecombe, Rick P 1 year, 1 month ago

On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> This SEPT SEAMCALL retry proposal aims to remove patch
> "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> [1] at the tail of v2 series "TDX MMU Part 2".

We discussed this on the PUCK call. A couple alternatives were considered:
 - Avoiding 0-step. To handle signals kicking to userspace we could try to add
code to generate synthetic EPT violations if KVM thinks the 0-step mitigation
might be active (i.e. the fault was not resolved). The consensus was that this
would be continuing battle and possibly impossible due normal guest behavior
triggering the mitigation. 
 - Pre-faulting all S-EPT, such that contention with AUG won't happen. The
discussion was that this would only be a temporary solution as the MMU
operations get more complicated (huge pages, etc). Also there is also
private/shared conversions and memory hotplug already.

So we will proceed with this kick+lock+retry solution. The reasoning is to
optimize for the normal non-contention path, without having an overly
complicated solution for KVM.

In all the branch commotion recently, these patches fell out of our dev branch.
So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We
need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no
issues, we can post the 6.13 rebase to included in kvm-coco-queue with
instructions on which patches to remove from kvm-coco-queue (i.e. the 16
retries).

We also briefly touched on the TDX module behavior where guest operations can
lock NP PTEs. The kick solution doesn't require changing this functionally, but
it should still be done to help with debugging issues related to KVM's
contention solution.

Thanks all for the discussion!

Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal

Posted by Sean Christopherson 1 year, 1 month ago

On Tue, Dec 17, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> > This SEPT SEAMCALL retry proposal aims to remove patch
> > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> > [1] at the tail of v2 series "TDX MMU Part 2".
> 
> We discussed this on the PUCK call. A couple alternatives were considered:
>  - Avoiding 0-step. To handle signals kicking to userspace we could try to add
> code to generate synthetic EPT violations if KVM thinks the 0-step mitigation
> might be active (i.e. the fault was not resolved). The consensus was that this
> would be continuing battle and possibly impossible due normal guest behavior
> triggering the mitigation. 

Specifically, the TDX Module takes its write-lock if the guest takes EPT
violations exits on the same RIP 6 times, i.e. detects forward progress based
purely on the RIP at entry vs. exit.  So a guest that is touching memory in a
loop could trigger zero-step checking even if KVM promptly fixes every EPT
violation.

>  - Pre-faulting all S-EPT, such that contention with AUG won't happen. The
> discussion was that this would only be a temporary solution as the MMU
> operations get more complicated (huge pages, etc). Also there is also
> private/shared conversions and memory hotplug already.
> 
> So we will proceed with this kick+lock+retry solution. The reasoning is to
> optimize for the normal non-contention path, without having an overly
> complicated solution for KVM.
> 
> In all the branch commotion recently, these patches fell out of our dev branch.
> So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We
> need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no
> issues, we can post the 6.13 rebase to included in kvm-coco-queue with
> instructions on which patches to remove from kvm-coco-queue (i.e. the 16
> retries).
> 
> 
> We also briefly touched on the TDX module behavior where guest operations can
> lock NP PTEs. The kick solution doesn't require changing this functionally, but
> it should still be done to help with debugging issues related to KVM's
> contention solution.

And so that KVM developers don't have to deal with customer escalations due to
performance issues caused by known flaws in the TDX module.

> 
> Thanks all for the discussion!

Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal

Posted by Yan Zhao 1 year, 1 month ago

On Thu, Nov 21, 2024 at 07:51:39PM +0800, Yan Zhao wrote:
> This SEPT SEAMCALL retry proposal aims to remove patch
> "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> [1] at the tail of v2 series "TDX MMU Part 2".
> 
> ==brief history==
> 
> In the v1 series 'TDX MMU Part 2', there were several discussions regarding
> the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the
> SEAMCALL wrapper tdx_seamcall_sept().
> 
> The lock status of each SEAMCALL relevant to KVM was analyzed in [2].
> 
> The conclusion was that 16 retries was necessary because
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
>   When the TDX module detects that EPT violations are caused by the same
>   RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter()
>   will take SEPT tree lock and contend with tdh_mem*().
> 
> - tdg_mem_page_accept() can contend with other tdh_mem*().
> 
> 
> Sean provided several good suggestions[3], including:
> - Implement retries within TDX code when the TDP MMU returns
>   RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering
>   0-step mitigation.
> - It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*()
>   inside TDX module.
> - Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and
>   prevent tdh_vp_enter() during page uninstallation.
> 
> Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is
> insufficient to prevent 0-step mitigation [4].
> 
> Rick and Yan then consulted TDX module team with findings that:
> - The threshold of zero-step mitigation is counted per vCPU.
>   It's of value 6 because
> 
>     "There can be at most 2 mapping faults on instruction fetch
>      (x86 macro-instructions length is at most 15 bytes) when the
>      instruction crosses page boundary; then there can be at most 2
>      mapping faults for each memory operand, when the operand crosses
>      page boundary. For most of x86 macro-instructions, there are up to 2
>      memory operands and each one of them is small, which brings us to
>      maximum 2+2*2 = 6 legal mapping faults."
>   
> - Besides tdg_mem_page_accept(),  tdg_mem_page_attr_rd/wr() can also 
>   contend with SEAMCALLs tdh_mem*().
> 
> So, we decided to make a proposal to tolerate 0-step mitigation.
> 
> ==proposal details==
> 
> The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> together which are required for page installation and uninstallation.
> 
> These SEAMCALLs can be divided into three groups:
> Group 1: tdh_mem_page_add().
>          The SEAMCALL is invoked only during TD build time and therefore
>          KVM has ensured that no contention will occur.
> 
>          Proposal: (as in patch 1)
>          Just return error when TDX_OPERAND_BUSY is found.
> 
> Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
>          These two SEAMCALLs are invoked for page installation. 
>          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> 	 tdg_mem_page_attr_rd/wr().
> 
>          Proposal: (as in patch 1)
>          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
>            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
>          
>          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
>            long as there are no pending signals/interrupts.
Alternatively, we can have the tdx_handle_ept_violation() do not retry
internally TDX.
Instead, keep the 16 times retries in tdx_seamcall_sept() for                    
tdh_mem_sept_add() and tdh_mem_page_aug() only, i.e. only for SEAMCALLs in       
Group 2.

> 
>          The retry inside TDX aims to reduce the count of tdh_vp_enter()
>          before resolving EPT violations in the local vCPU, thereby
>          minimizing contentions with other vCPUs. However, it can't
>          completely eliminate 0-step mitigation as it exits when there're
>          pending signals/interrupts and does not (and cannot) remove the
>          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> 
>          Resources    SHARED  users      EXCLUSIVE users
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
>                     tdh_mem_page_aug
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_sept_add (Host lock)
>                                          tdh_mem_page_aug (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
> Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
>          These three SEAMCALLs are invoked for page uninstallation, with
>          KVM mmu_lock held for writing.
> 
>          Resources     SHARED users      EXCLUSIVE users
>          ------------------------------------------------------------
>          TDCS epoch    tdh_vp_enter      tdh_mem_track
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
>                                          tdh_mem_range_block   
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_range_block (Host lock)
>                                          tdh_mem_page_remove (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
>          Proposal: (as in patch 2)
>          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
>            once.
>          - During the retry, kick off all vCPUs and prevent any vCPU from
>            entering to avoid potential contentions.
> 
>          This is because tdh_vp_enter() and TDCALLs are not protected by
>          KVM mmu_lock, and remove_external_spte() in KVM must not fail.
> 
> 
> 
> SEAMCALL                Lock Type        Resource 
> -----------------------------Group 1--------------------------------
> tdh_mem_page_add        EXCLUSIVE        TDR
>                         NO_LOCK          TDCS
>                         NO_LOCK          SEPT tree
>                         EXCLUSIVE        page to add
> 
> ----------------------------Group 2--------------------------------
> tdh_mem_sept_add        SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to add
> 
> 
> tdh_mem_page_aug        SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to aug
> 
> ----------------------------Group 3--------------------------------
> tdh_mem_range_block     SHARED           TDR
>                         SHARED           TDCS
>                         EXCLUSIVE        SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
> 
> tdh_mem_track           SHARED           TDR
>                         SHARED           TDCS
>                         EXCLUSIVE        TDCS epoch
> 
> tdh_mem_page_remove     SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to remove
> 
> 
> [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
> [2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
> [3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com
> [4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com
> 
> Yan Zhao (2):
>   KVM: TDX: Retry in TDX when installing TD private/sept pages
>   KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
> 
>  arch/x86/include/asm/kvm_host.h |   2 +
>  arch/x86/kvm/mmu/mmu.c          |   2 +-
>  arch/x86/kvm/vmx/tdx.c          | 102 +++++++++++++++++++++++++-------
>  3 files changed, 85 insertions(+), 21 deletions(-)
> 
> -- 
> 2.43.2
>

Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal

Posted by Edgecombe, Rick P 1 year, 2 months ago

On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> ==proposal details==
> 
> The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> together which are required for page installation and uninstallation.
> 
> These SEAMCALLs can be divided into three groups:
> Group 1: tdh_mem_page_add().
>          The SEAMCALL is invoked only during TD build time and therefore
>          KVM has ensured that no contention will occur.
> 
>          Proposal: (as in patch 1)
>          Just return error when TDX_OPERAND_BUSY is found.
> 
> Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
>          These two SEAMCALLs are invoked for page installation. 
>          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> 	 tdg_mem_page_attr_rd/wr().

We did verify with TDX module folks that the TDX module could be changed to not
take the sept host priority lock for zero entries (that happen during the guest
operations). In that case, I think we shouldn't expect contention for
tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for
tdh_vp_enter() though.


> 
>          Proposal: (as in patch 1)
>          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
>            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
>          
>          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
>            long as there are no pending signals/interrupts.
> 
>          The retry inside TDX aims to reduce the count of tdh_vp_enter()
>          before resolving EPT violations in the local vCPU, thereby
>          minimizing contentions with other vCPUs. However, it can't
>          completely eliminate 0-step mitigation as it exits when there're
>          pending signals/interrupts and does not (and cannot) remove the
>          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> 
>          Resources    SHARED  users      EXCLUSIVE users
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
>                     tdh_mem_page_aug
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_sept_add (Host lock)
>                                          tdh_mem_page_aug (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
> Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
>          These three SEAMCALLs are invoked for page uninstallation, with
>          KVM mmu_lock held for writing.
> 
>          Resources     SHARED users      EXCLUSIVE users
>          ------------------------------------------------------------
>          TDCS epoch    tdh_vp_enter      tdh_mem_track
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
>                                          tdh_mem_range_block   
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_range_block (Host lock)
>                                          tdh_mem_page_remove (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
>          Proposal: (as in patch 2)
>          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
>            once.
>          - During the retry, kick off all vCPUs and prevent any vCPU from
>            entering to avoid potential contentions.
> 
>          This is because tdh_vp_enter() and TDCALLs are not protected by
>          KVM mmu_lock, and remove_external_spte() in KVM must not fail.

The solution for group 3 actually doesn't look too bad at all to me. At least
from code and complexity wise. It's pretty compact, doesn't add any locks, and
limited to the tdx.c code. Although, I didn't evaluate the implementation 
correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet.

Were you able to test the fallback path at all? Can we think of any real
situations where it could be burdensome?

One other thing to note I think, is that group 3 are part of no-fail operations.
The core KVM calling code doesn't have the understanding of a failure there. So
in this scheme of not avoiding contention we have to succeed before returning,
where group 1 and 2 can fail, so don't need the special fallback scheme.

Re: [RFC PATCH 0/2] SEPT SEAMCALL retry proposal

Posted by Yan Zhao 1 year, 2 months ago

On Tue, Nov 26, 2024 at 08:46:51AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> > ==proposal details==
> > 
> > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> > together which are required for page installation and uninstallation.
> > 
> > These SEAMCALLs can be divided into three groups:
> > Group 1: tdh_mem_page_add().
> >          The SEAMCALL is invoked only during TD build time and therefore
> >          KVM has ensured that no contention will occur.
> > 
> >          Proposal: (as in patch 1)
> >          Just return error when TDX_OPERAND_BUSY is found.
> > 
> > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
> >          These two SEAMCALLs are invoked for page installation. 
> >          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> > 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> > 	 tdg_mem_page_attr_rd/wr().
> 
> We did verify with TDX module folks that the TDX module could be changed to not
> take the sept host priority lock for zero entries (that happen during the guest
> operations). In that case, I think we shouldn't expect contention for
> tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for
> tdh_vp_enter() though.
Ah, you are right.

I previously incorrectly thought TDX module will avoid locking free entries for
tdg_mem_page_accept() only.


> 
> 
> > 
> >          Proposal: (as in patch 1)
> >          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
> >            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
> >          
> >          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
> >            long as there are no pending signals/interrupts.
> > 
> >          The retry inside TDX aims to reduce the count of tdh_vp_enter()
> >          before resolving EPT violations in the local vCPU, thereby
> >          minimizing contentions with other vCPUs. However, it can't
> >          completely eliminate 0-step mitigation as it exits when there're
> >          pending signals/interrupts and does not (and cannot) remove the
> >          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> > 
> >          Resources    SHARED  users      EXCLUSIVE users
> >          ------------------------------------------------------------
> >          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
> >                     tdh_mem_page_aug
> >          ------------------------------------------------------------
> >          SEPT entry                      tdh_mem_sept_add (Host lock)
> >                                          tdh_mem_page_aug (Host lock)
> >                                          tdg_mem_page_accept (Guest lock)
> >                                          tdg_mem_page_attr_rd (Guest lock)
> >                                          tdg_mem_page_attr_wr (Guest lock)
> > 
> > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
> >          These three SEAMCALLs are invoked for page uninstallation, with
> >          KVM mmu_lock held for writing.
> > 
> >          Resources     SHARED users      EXCLUSIVE users
> >          ------------------------------------------------------------
> >          TDCS epoch    tdh_vp_enter      tdh_mem_track
> >          ------------------------------------------------------------
> >          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
> >                                          tdh_mem_range_block   
> >          ------------------------------------------------------------
> >          SEPT entry                      tdh_mem_range_block (Host lock)
> >                                          tdh_mem_page_remove (Host lock)
> >                                          tdg_mem_page_accept (Guest lock)
> >                                          tdg_mem_page_attr_rd (Guest lock)
> >                                          tdg_mem_page_attr_wr (Guest lock)
> > 
> >          Proposal: (as in patch 2)
> >          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
> >            once.
> >          - During the retry, kick off all vCPUs and prevent any vCPU from
> >            entering to avoid potential contentions.
> > 
> >          This is because tdh_vp_enter() and TDCALLs are not protected by
> >          KVM mmu_lock, and remove_external_spte() in KVM must not fail.
> 
> The solution for group 3 actually doesn't look too bad at all to me. At least
> from code and complexity wise. It's pretty compact, doesn't add any locks, and
> limited to the tdx.c code. Although, I didn't evaluate the implementation 
> correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet.
> 
> Were you able to test the fallback path at all? Can we think of any real
> situations where it could be burdensome?
Yes, I did some negative tests to fail block/track/remove to check if the
tdx_no_vcpus_enter*() work.

Even without those negative tests, it's not rare for tdh_mem_track() to return
busy due to its contention with tdh_vp_enter().

Given that normally it's not frequent to find tdh_mem_range_block() or
tdh_mem_page_remove() to return busy (if we reduce the frequency of zero-step
mitigation) and that tdh_mem_track() will kick off all vCPUs later any way, I
think it's acceptable to do the tdx_no_vcpus_enter*() stuffs in the page removal
path.


> One other thing to note I think, is that group 3 are part of no-fail operations.
> The core KVM calling code doesn't have the understanding of a failure there. So
> in this scheme of not avoiding contention we have to succeed before returning,
> where group 1 and 2 can fail, so don't need the special fallback scheme.
Yes, exactly.

[RFC PATCH 1/2] KVM: TDX: Retry in TDX when installing TD private/sept pages

Posted by Yan Zhao 1 year, 2 months ago

For tdh_mem_page_add, Just return error when TDX_OPERAND_BUSY is found.

For tdh_mem_sept_add(), tdh_mem_page_aug(),
- Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
  to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
- Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
  long as there are no pending signals/interrupts.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c |  2 +-
 arch/x86/kvm/vmx/tdx.c | 53 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd2eaa1dbebb..f16c0e7248eb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6179,7 +6179,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 		vcpu->stat.pf_spurious++;
 
 	if (r != RET_PF_EMULATE)
-		return 1;
+		return r;
 
 emulate:
 	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 23e6f25dd837..60d9e9d050ad 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1705,8 +1705,9 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 
 	err = tdh_mem_sept_add(to_kvm_tdx(kvm)->tdr_pa, gpa, tdx_level, hpa, &entry,
 			       &level_state);
-	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
-		return -EAGAIN;
+	if (unlikely(err & TDX_OPERAND_BUSY))
+		return -EBUSY;
+
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
 		return -EIO;
@@ -1855,6 +1856,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	gpa_t gpa = tdexit_gpa(vcpu);
 	unsigned long exit_qual;
+	bool local_retry = false;
+	int ret;
 
 	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
 		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
@@ -1873,6 +1876,23 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 * due to aliasing a single HPA to multiple GPAs.
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
+
+		/*
+		 * Mapping of private memory may meet RET_PF_RETRY due to
+		 * SEAMCALL contentions, e.g.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU vs
+		 *   TDH.VP.ENTER with 0-step mitigation on a remote vCPU.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU vs
+		 *   TDG.MEM.PAGE.ACCEPT on a remote vCPU.
+		 *
+		 * Retry internally in TDX to prevent exacerbating the
+		 * activation of 0-step mitigation on local vCPU.
+		 * However, despite these retries, the 0-step mitigation on the
+		 * local vCPU may still be triggered due to:
+		 * - Exiting on signals, interrupts.
+		 * - KVM_EXIT_MEMORY_FAULT.
+		 */
+		local_retry = true;
 	} else {
 		exit_qual = tdexit_exit_qual(vcpu);
 		/*
@@ -1885,7 +1905,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
-	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+
+	while (1) {
+		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+
+		if (ret != RET_PF_RETRY || !local_retry)
+			break;
+
+		/*
+		 * Break and keep the orig return value.
+		 * Signal & irq handling will be done later in vcpu_run()
+		 */
+		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
+		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
+			break;
+
+		cond_resched();
+	}
+	return ret;
 }
 
 int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
@@ -3028,13 +3065,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	}
 
 	ret = 0;
-	do {
-		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
-				       pfn_to_hpa(page_to_pfn(page)),
-				       &entry, &level_state);
-	} while (err == TDX_ERROR_SEPT_BUSY);
+	err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
+			       pfn_to_hpa(page_to_pfn(page)),
+			       &entry, &level_state);
 	if (err) {
-		ret = -EIO;
+		ret = unlikely(err & TDX_OPERAND_BUSY) ? -EBUSY : -EIO;
 		goto out;
 	}
 
-- 
2.43.2

[RFC PATCH 2/2] KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal

Posted by Yan Zhao 1 year, 2 months ago

For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),

- Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
- During the retry, kick off all vCPUs and prevent any vCPU from entering
  to avoid potential contentions.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/vmx/tdx.c          | 49 +++++++++++++++++++++++++--------
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 521c7cf725bc..bb7592110337 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -123,6 +123,8 @@
 #define KVM_REQ_HV_TLB_FLUSH \
 	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
+#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
+	KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 60d9e9d050ad..ed6b41bbcec6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
 	__mb();
 }
 
+static void tdx_no_vcpus_enter_start(struct kvm *kvm)
+{
+	kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
+}
+
+static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_NO_VCPU_ENTER_INPROGRESS, vcpu);
+}
+
 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
 static int __tdx_reclaim_page(hpa_t pa)
 {
@@ -1648,15 +1662,20 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
 		return -EINVAL;
 
-	do {
-		/*
-		 * When zapping private page, write lock is held. So no race
-		 * condition with other vcpu sept operation.  Race only with
-		 * TDH.VP.ENTER.
-		 */
+	/*
+	 * When zapping private page, write lock is held. So no race
+	 * condition with other vcpu sept operation.  Race only with
+	 * TDH.VP.ENTER.
+	 */
+	err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
+				  &level_state);
+	if ((err & TDX_OPERAND_BUSY)) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
 					  &level_state);
-	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 
 	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
 		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
@@ -1728,8 +1747,12 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	WARN_ON_ONCE(level != PG_LEVEL_4K);
 
 	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
-	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
-		return -EAGAIN;
+	if (unlikely(err & TDX_OPERAND_BUSY)) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
+		err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
 		return -EIO;
@@ -1772,9 +1795,13 @@ static void tdx_track(struct kvm *kvm)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	do {
+	err = tdh_mem_track(kvm_tdx->tdr_pa);
+	if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_track(kvm_tdx->tdr_pa);
-	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+		tdx_no_vcpus_enter_stop(kvm);
+	}
 
 	if (KVM_BUG_ON(err, kvm))
 		pr_tdx_error(TDH_MEM_TRACK, err);
-- 
2.43.2