arch/x86/include/asm/tdx.h | 9 + arch/x86/include/asm/vmx.h | 1 + arch/x86/include/uapi/asm/kvm.h | 10 + arch/x86/kvm/mmu.h | 4 + arch/x86/kvm/mmu/mmu.c | 7 +- arch/x86/kvm/mmu/page_track.c | 3 + arch/x86/kvm/mmu/spte.c | 8 +- arch/x86/kvm/mmu/tdp_mmu.c | 37 +- arch/x86/kvm/vmx/common.h | 43 ++ arch/x86/kvm/vmx/main.c | 104 ++++- arch/x86/kvm/vmx/tdx.c | 727 +++++++++++++++++++++++++++++++- arch/x86/kvm/vmx/tdx.h | 93 ++++ arch/x86/kvm/vmx/tdx_arch.h | 23 + arch/x86/kvm/vmx/vmx.c | 25 +- arch/x86/kvm/vmx/x86_ops.h | 51 +++ arch/x86/virt/vmx/tdx/tdx.c | 176 ++++++++ arch/x86/virt/vmx/tdx/tdx.h | 8 + virt/kvm/kvm_main.c | 1 + 18 files changed, 1278 insertions(+), 52 deletions(-) create mode 100644 arch/x86/kvm/vmx/common.h
Hi,
Here is v2 of the TDX “MMU part 2” series.
As discussed earlier, non-nit feedbacks from v1[0] have been applied.
- Among them, patch "KVM: TDX: MTRR: implement get_mt_mask() for TDX" was
dropped. The feature self-snoop was not made a dependency for enabling
TDX since checking for the feature self-snoop was not included in
kvm_mmu_may_ignore_guest_pat() in the base code. So, strickly speaking,
current code would incorrectly zap the mirrored root if non-coherent DMA
devices were hot-plugged.
There were also a few minor issues noticed by me and fixed without internal
discussion (noted in each patch's version log).
It’s now ready to hand off to Paolo/kvm-coco-queue.
One remaining item that requires further discussion is "How to handle
the TDX module lock contention (i.e. SEAMCALL retry replacements)".
The basis for future discussions includes:
(1) TDH.MEM.TRACK can contend with TDH.VP.ENTER on the TD epoch lock.
(2) TDH.VP.ENTER contends with TDH.MEM* on S-EPT tree lock when 0-stepping
mitigation is triggered.
- The threshold of zero-step mitigation is counted per-vCPU when the
TDX module finds that EPT violations are caused by the same RIP as
in the last TDH.VP.ENTER for 6 consecutive times.
The threshold value 6 is explained as
"There can be at most 2 mapping faults on instruction fetch
(x86 macro-instructions length is at most 15 bytes) when the
instruction crosses page boundary; then there can be at most 2
mapping faults for each memory operand, when the operand crosses
page boundary. For most of x86 macro-instructions, there are up to 2
memory operands and each one of them is small, which brings us to
maximum 2+2*2 = 6 legal mapping faults."
- If the EPT violations received by KVM are caused by
TDG.MEM.PAGE.ACCEPT, they will not trigger 0-stepping mitigation.
Since a TD is required to call TDG.MEM.PAGE.ACCEPT before accessing a
private memory when configured with pending_ve_disable=Y, 0-stepping
mitigation is not expected to occur in such a TD.
(3) TDG.MEM.PAGE.ACCEPT can contend with SEAMCALLs TDH.MEM*.
(Actually, TDG.MEM.PAGE.ATTR.RD or TDG.MEM.PAGE.ATTR.WR can also
contend with SEAMCALLs TDH.MEM*. Although we don't need to consider
these two TDCALLs when enabling basic TDX, they are allowed by the
TDX module, and we can't control whether a TD invokes a TDCALL or
not).
The "KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" is
still in place in this series (at the tail), but we should drop it when we
finalize on the real solution.
This series has 5 commits intended to collect Acks from x86 maintainers.
These commits introduce and export SEAMCALL wrappers to allow KVM to manage
the S-EPT (the EPT that maps private memory and is protected by the TDX
module):
x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
pages
x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
contents
This series is based off of a kvm-coco-queue commit and some pre-req
series:
1. commit ee69eb746754 ("KVM: x86/mmu: Prevent aliased memslot GFNs") (in
kvm-coco-queue).
2. v7 of "TDX host: metadata reading tweaks, bug fix and info dump" [1].
3. v1 of "KVM: VMX: Initialize TDX when loading KVM module" [2], with some
new feedback from Sean.
4. v2 of “TDX vCPU/VM creation” [3]
It requires TDX module 1.5.06.00.0744[4], or later. This is due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. Now NO_RBP_MOD is enabled (in VM/vCPU creation patches), and this
particular version of the TDX module has a required NO_RBP_MOD related bug
fix.
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").
The series has been tested as part of the development branch for the TDX
base series. The testing consisted of TDX kvm-unit-tests and booting a
Linux TD, and TDX enhanced KVM selftests.
The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-11.3
Matching QEMU:
https://github.com/intel-staging/qemu-tdx/commits/tdx-qemu-upstream-v6.1/
[0] https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/kvm/cover.1731318868.git.kai.huang@intel.com/#t
[2] https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com/
[3] https://lore.kernel.org/kvm/20241030190039.77971-1-rick.p.edgecombe@intel.com/
[4] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06
Isaku Yamahata (17):
KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
KVM: TDX: Add accessors VMX VMCS helpers
KVM: TDX: Set gfn_direct_bits to shared bit
x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT
pages
x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial
contents
KVM: TDX: Require TDP MMU and mmio caching for TDX
KVM: x86/mmu: Add setter for shadow_mmio_value
KVM: TDX: Set per-VM shadow_mmio_value to 0
KVM: TDX: Handle TLB tracking for TDX
KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page
table
KVM: TDX: Implement hook to get max mapping level of private pages
KVM: TDX: Add an ioctl to create initial guest memory
KVM: TDX: Finalize VM initialization
KVM: TDX: Handle vCPU dissociation
Rick Edgecombe (3):
KVM: x86/mmu: Implement memslot deletion for TDX
KVM: VMX: Teach EPT violation helper about private mem
KVM: x86/mmu: Export kvm_tdp_map_page()
Sean Christopherson (2):
KVM: VMX: Split out guts of EPT violation to common/exposed function
KVM: TDX: Add load_mmu_pgd method for TDX
Yan Zhao (1):
KVM: x86/mmu: Do not enable page track for TD guest
Yuan Yao (1):
[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand
SEPT
arch/x86/include/asm/tdx.h | 9 +
arch/x86/include/asm/vmx.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 10 +
arch/x86/kvm/mmu.h | 4 +
arch/x86/kvm/mmu/mmu.c | 7 +-
arch/x86/kvm/mmu/page_track.c | 3 +
arch/x86/kvm/mmu/spte.c | 8 +-
arch/x86/kvm/mmu/tdp_mmu.c | 37 +-
arch/x86/kvm/vmx/common.h | 43 ++
arch/x86/kvm/vmx/main.c | 104 ++++-
arch/x86/kvm/vmx/tdx.c | 727 +++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 93 ++++
arch/x86/kvm/vmx/tdx_arch.h | 23 +
arch/x86/kvm/vmx/vmx.c | 25 +-
arch/x86/kvm/vmx/x86_ops.h | 51 +++
arch/x86/virt/vmx/tdx/tdx.c | 176 ++++++++
arch/x86/virt/vmx/tdx/tdx.h | 8 +
virt/kvm/kvm_main.c | 1 +
18 files changed, 1278 insertions(+), 52 deletions(-)
create mode 100644 arch/x86/kvm/vmx/common.h
--
2.43.2
On 11/12/24 08:33, Yan Zhao wrote: > Hi, > > Here is v2 of the TDX “MMU part 2” series. > As discussed earlier, non-nit feedbacks from v1[0] have been applied. > - Among them, patch "KVM: TDX: MTRR: implement get_mt_mask() for TDX" was > dropped. The feature self-snoop was not made a dependency for enabling > TDX since checking for the feature self-snoop was not included in > kvm_mmu_may_ignore_guest_pat() in the base code. So, strickly speaking, > current code would incorrectly zap the mirrored root if non-coherent DMA > devices were hot-plugged. > > There were also a few minor issues noticed by me and fixed without internal > discussion (noted in each patch's version log). > > It’s now ready to hand off to Paolo/kvm-coco-queue. > > > One remaining item that requires further discussion is "How to handle > the TDX module lock contention (i.e. SEAMCALL retry replacements)". > The basis for future discussions includes: > (1) TDH.MEM.TRACK can contend with TDH.VP.ENTER on the TD epoch lock. > (2) TDH.VP.ENTER contends with TDH.MEM* on S-EPT tree lock when 0-stepping > mitigation is triggered. > - The threshold of zero-step mitigation is counted per-vCPU when the > TDX module finds that EPT violations are caused by the same RIP as > in the last TDH.VP.ENTER for 6 consecutive times. > The threshold value 6 is explained as > "There can be at most 2 mapping faults on instruction fetch > (x86 macro-instructions length is at most 15 bytes) when the > instruction crosses page boundary; then there can be at most 2 > mapping faults for each memory operand, when the operand crosses > page boundary. For most of x86 macro-instructions, there are up to 2 > memory operands and each one of them is small, which brings us to > maximum 2+2*2 = 6 legal mapping faults." > - If the EPT violations received by KVM are caused by > TDG.MEM.PAGE.ACCEPT, they will not trigger 0-stepping mitigation. > Since a TD is required to call TDG.MEM.PAGE.ACCEPT before accessing a > private memory when configured with pending_ve_disable=Y, 0-stepping > mitigation is not expected to occur in such a TD. > (3) TDG.MEM.PAGE.ACCEPT can contend with SEAMCALLs TDH.MEM*. > (Actually, TDG.MEM.PAGE.ATTR.RD or TDG.MEM.PAGE.ATTR.WR can also > contend with SEAMCALLs TDH.MEM*. Although we don't need to consider > these two TDCALLs when enabling basic TDX, they are allowed by the > TDX module, and we can't control whether a TD invokes a TDCALL or > not). > > The "KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" is > still in place in this series (at the tail), but we should drop it when we > finalize on the real solution. > > > This series has 5 commits intended to collect Acks from x86 maintainers. > These commits introduce and export SEAMCALL wrappers to allow KVM to manage > the S-EPT (the EPT that maps private memory and is protected by the TDX > module): > > x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT > pages > x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages > x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking > x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page > x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial > contents Apart from the possible changes to the SEAMCALL wrappers, this is in good shape. Thanks, Paolo
This SEPT SEAMCALL retry proposal aims to remove patch
"[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
[1] at the tail of v2 series "TDX MMU Part 2".
==brief history==
In the v1 series 'TDX MMU Part 2', there were several discussions regarding
the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the
SEAMCALL wrapper tdx_seamcall_sept().
The lock status of each SEAMCALL relevant to KVM was analyzed in [2].
The conclusion was that 16 retries was necessary because
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
When the TDX module detects that EPT violations are caused by the same
RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter()
will take SEPT tree lock and contend with tdh_mem*().
- tdg_mem_page_accept() can contend with other tdh_mem*().
Sean provided several good suggestions[3], including:
- Implement retries within TDX code when the TDP MMU returns
RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering
0-step mitigation.
- It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*()
inside TDX module.
- Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and
prevent tdh_vp_enter() during page uninstallation.
Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is
insufficient to prevent 0-step mitigation [4].
Rick and Yan then consulted TDX module team with findings that:
- The threshold of zero-step mitigation is counted per vCPU.
It's of value 6 because
"There can be at most 2 mapping faults on instruction fetch
(x86 macro-instructions length is at most 15 bytes) when the
instruction crosses page boundary; then there can be at most 2
mapping faults for each memory operand, when the operand crosses
page boundary. For most of x86 macro-instructions, there are up to 2
memory operands and each one of them is small, which brings us to
maximum 2+2*2 = 6 legal mapping faults."
- Besides tdg_mem_page_accept(), tdg_mem_page_attr_rd/wr() can also
contend with SEAMCALLs tdh_mem*().
So, we decided to make a proposal to tolerate 0-step mitigation.
==proposal details==
The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
together which are required for page installation and uninstallation.
These SEAMCALLs can be divided into three groups:
Group 1: tdh_mem_page_add().
The SEAMCALL is invoked only during TD build time and therefore
KVM has ensured that no contention will occur.
Proposal: (as in patch 1)
Just return error when TDX_OPERAND_BUSY is found.
Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
These two SEAMCALLs are invoked for page installation.
They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
(due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
tdg_mem_page_attr_rd/wr().
Proposal: (as in patch 1)
- Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
- Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
long as there are no pending signals/interrupts.
The retry inside TDX aims to reduce the count of tdh_vp_enter()
before resolving EPT violations in the local vCPU, thereby
minimizing contentions with other vCPUs. However, it can't
completely eliminate 0-step mitigation as it exits when there're
pending signals/interrupts and does not (and cannot) remove the
tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
Resources SHARED users EXCLUSIVE users
------------------------------------------------------------
SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation)
tdh_mem_page_aug
------------------------------------------------------------
SEPT entry tdh_mem_sept_add (Host lock)
tdh_mem_page_aug (Host lock)
tdg_mem_page_accept (Guest lock)
tdg_mem_page_attr_rd (Guest lock)
tdg_mem_page_attr_wr (Guest lock)
Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
These three SEAMCALLs are invoked for page uninstallation, with
KVM mmu_lock held for writing.
Resources SHARED users EXCLUSIVE users
------------------------------------------------------------
TDCS epoch tdh_vp_enter tdh_mem_track
------------------------------------------------------------
SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation)
tdh_mem_range_block
------------------------------------------------------------
SEPT entry tdh_mem_range_block (Host lock)
tdh_mem_page_remove (Host lock)
tdg_mem_page_accept (Guest lock)
tdg_mem_page_attr_rd (Guest lock)
tdg_mem_page_attr_wr (Guest lock)
Proposal: (as in patch 2)
- Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
once.
- During the retry, kick off all vCPUs and prevent any vCPU from
entering to avoid potential contentions.
This is because tdh_vp_enter() and TDCALLs are not protected by
KVM mmu_lock, and remove_external_spte() in KVM must not fail.
SEAMCALL Lock Type Resource
-----------------------------Group 1--------------------------------
tdh_mem_page_add EXCLUSIVE TDR
NO_LOCK TDCS
NO_LOCK SEPT tree
EXCLUSIVE page to add
----------------------------Group 2--------------------------------
tdh_mem_sept_add SHARED TDR
SHARED TDCS
SHARED SEPT tree
HOST,EXCLUSIVE SEPT entry to modify
EXCLUSIVE page to add
tdh_mem_page_aug SHARED TDR
SHARED TDCS
SHARED SEPT tree
HOST,EXCLUSIVE SEPT entry to modify
EXCLUSIVE page to aug
----------------------------Group 3--------------------------------
tdh_mem_range_block SHARED TDR
SHARED TDCS
EXCLUSIVE SEPT tree
HOST,EXCLUSIVE SEPT entry to modify
tdh_mem_track SHARED TDR
SHARED TDCS
EXCLUSIVE TDCS epoch
tdh_mem_page_remove SHARED TDR
SHARED TDCS
SHARED SEPT tree
HOST,EXCLUSIVE SEPT entry to modify
EXCLUSIVE page to remove
[1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
[2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
[3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com
[4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com
Yan Zhao (2):
KVM: TDX: Retry in TDX when installing TD private/sept pages
KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/vmx/tdx.c | 102 +++++++++++++++++++++++++-------
3 files changed, 85 insertions(+), 21 deletions(-)
--
2.43.2
On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > This SEPT SEAMCALL retry proposal aims to remove patch > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" > [1] at the tail of v2 series "TDX MMU Part 2". We discussed this on the PUCK call. A couple alternatives were considered: - Avoiding 0-step. To handle signals kicking to userspace we could try to add code to generate synthetic EPT violations if KVM thinks the 0-step mitigation might be active (i.e. the fault was not resolved). The consensus was that this would be continuing battle and possibly impossible due normal guest behavior triggering the mitigation. - Pre-faulting all S-EPT, such that contention with AUG won't happen. The discussion was that this would only be a temporary solution as the MMU operations get more complicated (huge pages, etc). Also there is also private/shared conversions and memory hotplug already. So we will proceed with this kick+lock+retry solution. The reasoning is to optimize for the normal non-contention path, without having an overly complicated solution for KVM. In all the branch commotion recently, these patches fell out of our dev branch. So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no issues, we can post the 6.13 rebase to included in kvm-coco-queue with instructions on which patches to remove from kvm-coco-queue (i.e. the 16 retries). We also briefly touched on the TDX module behavior where guest operations can lock NP PTEs. The kick solution doesn't require changing this functionally, but it should still be done to help with debugging issues related to KVM's contention solution. Thanks all for the discussion!
On Tue, Dec 17, 2024, Rick P Edgecombe wrote: > On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > > This SEPT SEAMCALL retry proposal aims to remove patch > > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" > > [1] at the tail of v2 series "TDX MMU Part 2". > > We discussed this on the PUCK call. A couple alternatives were considered: > - Avoiding 0-step. To handle signals kicking to userspace we could try to add > code to generate synthetic EPT violations if KVM thinks the 0-step mitigation > might be active (i.e. the fault was not resolved). The consensus was that this > would be continuing battle and possibly impossible due normal guest behavior > triggering the mitigation. Specifically, the TDX Module takes its write-lock if the guest takes EPT violations exits on the same RIP 6 times, i.e. detects forward progress based purely on the RIP at entry vs. exit. So a guest that is touching memory in a loop could trigger zero-step checking even if KVM promptly fixes every EPT violation. > - Pre-faulting all S-EPT, such that contention with AUG won't happen. The > discussion was that this would only be a temporary solution as the MMU > operations get more complicated (huge pages, etc). Also there is also > private/shared conversions and memory hotplug already. > > So we will proceed with this kick+lock+retry solution. The reasoning is to > optimize for the normal non-contention path, without having an overly > complicated solution for KVM. > > In all the branch commotion recently, these patches fell out of our dev branch. > So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We > need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no > issues, we can post the 6.13 rebase to included in kvm-coco-queue with > instructions on which patches to remove from kvm-coco-queue (i.e. the 16 > retries). > > > We also briefly touched on the TDX module behavior where guest operations can > lock NP PTEs. The kick solution doesn't require changing this functionally, but > it should still be done to help with debugging issues related to KVM's > contention solution. And so that KVM developers don't have to deal with customer escalations due to performance issues caused by known flaws in the TDX module. > > Thanks all for the discussion!
On Thu, Nov 21, 2024 at 07:51:39PM +0800, Yan Zhao wrote: > This SEPT SEAMCALL retry proposal aims to remove patch > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" > [1] at the tail of v2 series "TDX MMU Part 2". > > ==brief history== > > In the v1 series 'TDX MMU Part 2', there were several discussions regarding > the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the > SEAMCALL wrapper tdx_seamcall_sept(). > > The lock status of each SEAMCALL relevant to KVM was analyzed in [2]. > > The conclusion was that 16 retries was necessary because > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > When the TDX module detects that EPT violations are caused by the same > RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter() > will take SEPT tree lock and contend with tdh_mem*(). > > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > > Sean provided several good suggestions[3], including: > - Implement retries within TDX code when the TDP MMU returns > RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering > 0-step mitigation. > - It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*() > inside TDX module. > - Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and > prevent tdh_vp_enter() during page uninstallation. > > Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is > insufficient to prevent 0-step mitigation [4]. > > Rick and Yan then consulted TDX module team with findings that: > - The threshold of zero-step mitigation is counted per vCPU. > It's of value 6 because > > "There can be at most 2 mapping faults on instruction fetch > (x86 macro-instructions length is at most 15 bytes) when the > instruction crosses page boundary; then there can be at most 2 > mapping faults for each memory operand, when the operand crosses > page boundary. For most of x86 macro-instructions, there are up to 2 > memory operands and each one of them is small, which brings us to > maximum 2+2*2 = 6 legal mapping faults." > > - Besides tdg_mem_page_accept(), tdg_mem_page_attr_rd/wr() can also > contend with SEAMCALLs tdh_mem*(). > > So, we decided to make a proposal to tolerate 0-step mitigation. > > ==proposal details== > > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs > together which are required for page installation and uninstallation. > > These SEAMCALLs can be divided into three groups: > Group 1: tdh_mem_page_add(). > The SEAMCALL is invoked only during TD build time and therefore > KVM has ensured that no contention will occur. > > Proposal: (as in patch 1) > Just return error when TDX_OPERAND_BUSY is found. > > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). > These two SEAMCALLs are invoked for page installation. > They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() > (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), > tdg_mem_page_attr_rd/wr(). > > Proposal: (as in patch 1) > - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY > to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). > > - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as > long as there are no pending signals/interrupts. Alternatively, we can have the tdx_handle_ept_violation() do not retry internally TDX. Instead, keep the 16 times retries in tdx_seamcall_sept() for tdh_mem_sept_add() and tdh_mem_page_aug() only, i.e. only for SEAMCALLs in Group 2. > > The retry inside TDX aims to reduce the count of tdh_vp_enter() > before resolving EPT violations in the local vCPU, thereby > minimizing contentions with other vCPUs. However, it can't > completely eliminate 0-step mitigation as it exits when there're > pending signals/interrupts and does not (and cannot) remove the > tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) > tdh_mem_page_aug > ------------------------------------------------------------ > SEPT entry tdh_mem_sept_add (Host lock) > tdh_mem_page_aug (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). > These three SEAMCALLs are invoked for page uninstallation, with > KVM mmu_lock held for writing. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > TDCS epoch tdh_vp_enter tdh_mem_track > ------------------------------------------------------------ > SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) > tdh_mem_range_block > ------------------------------------------------------------ > SEPT entry tdh_mem_range_block (Host lock) > tdh_mem_page_remove (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Proposal: (as in patch 2) > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only > once. > - During the retry, kick off all vCPUs and prevent any vCPU from > entering to avoid potential contentions. > > This is because tdh_vp_enter() and TDCALLs are not protected by > KVM mmu_lock, and remove_external_spte() in KVM must not fail. > > > > SEAMCALL Lock Type Resource > -----------------------------Group 1-------------------------------- > tdh_mem_page_add EXCLUSIVE TDR > NO_LOCK TDCS > NO_LOCK SEPT tree > EXCLUSIVE page to add > > ----------------------------Group 2-------------------------------- > tdh_mem_sept_add SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to add > > > tdh_mem_page_aug SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to aug > > ----------------------------Group 3-------------------------------- > tdh_mem_range_block SHARED TDR > SHARED TDCS > EXCLUSIVE SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > > tdh_mem_track SHARED TDR > SHARED TDCS > EXCLUSIVE TDCS epoch > > tdh_mem_page_remove SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to remove > > > [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com > [2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com > [3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com > [4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com > > Yan Zhao (2): > KVM: TDX: Retry in TDX when installing TD private/sept pages > KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal > > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/vmx/tdx.c | 102 +++++++++++++++++++++++++------- > 3 files changed, 85 insertions(+), 21 deletions(-) > > -- > 2.43.2 >
On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > ==proposal details== > > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs > together which are required for page installation and uninstallation. > > These SEAMCALLs can be divided into three groups: > Group 1: tdh_mem_page_add(). > The SEAMCALL is invoked only during TD build time and therefore > KVM has ensured that no contention will occur. > > Proposal: (as in patch 1) > Just return error when TDX_OPERAND_BUSY is found. > > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). > These two SEAMCALLs are invoked for page installation. > They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() > (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), > tdg_mem_page_attr_rd/wr(). We did verify with TDX module folks that the TDX module could be changed to not take the sept host priority lock for zero entries (that happen during the guest operations). In that case, I think we shouldn't expect contention for tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for tdh_vp_enter() though. > > Proposal: (as in patch 1) > - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY > to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). > > - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as > long as there are no pending signals/interrupts. > > The retry inside TDX aims to reduce the count of tdh_vp_enter() > before resolving EPT violations in the local vCPU, thereby > minimizing contentions with other vCPUs. However, it can't > completely eliminate 0-step mitigation as it exits when there're > pending signals/interrupts and does not (and cannot) remove the > tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) > tdh_mem_page_aug > ------------------------------------------------------------ > SEPT entry tdh_mem_sept_add (Host lock) > tdh_mem_page_aug (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). > These three SEAMCALLs are invoked for page uninstallation, with > KVM mmu_lock held for writing. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > TDCS epoch tdh_vp_enter tdh_mem_track > ------------------------------------------------------------ > SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) > tdh_mem_range_block > ------------------------------------------------------------ > SEPT entry tdh_mem_range_block (Host lock) > tdh_mem_page_remove (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Proposal: (as in patch 2) > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only > once. > - During the retry, kick off all vCPUs and prevent any vCPU from > entering to avoid potential contentions. > > This is because tdh_vp_enter() and TDCALLs are not protected by > KVM mmu_lock, and remove_external_spte() in KVM must not fail. The solution for group 3 actually doesn't look too bad at all to me. At least from code and complexity wise. It's pretty compact, doesn't add any locks, and limited to the tdx.c code. Although, I didn't evaluate the implementation correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet. Were you able to test the fallback path at all? Can we think of any real situations where it could be burdensome? One other thing to note I think, is that group 3 are part of no-fail operations. The core KVM calling code doesn't have the understanding of a failure there. So in this scheme of not avoiding contention we have to succeed before returning, where group 1 and 2 can fail, so don't need the special fallback scheme.
On Tue, Nov 26, 2024 at 08:46:51AM +0800, Edgecombe, Rick P wrote: > On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > > ==proposal details== > > > > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs > > together which are required for page installation and uninstallation. > > > > These SEAMCALLs can be divided into three groups: > > Group 1: tdh_mem_page_add(). > > The SEAMCALL is invoked only during TD build time and therefore > > KVM has ensured that no contention will occur. > > > > Proposal: (as in patch 1) > > Just return error when TDX_OPERAND_BUSY is found. > > > > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). > > These two SEAMCALLs are invoked for page installation. > > They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() > > (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), > > tdg_mem_page_attr_rd/wr(). > > We did verify with TDX module folks that the TDX module could be changed to not > take the sept host priority lock for zero entries (that happen during the guest > operations). In that case, I think we shouldn't expect contention for > tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for > tdh_vp_enter() though. Ah, you are right. I previously incorrectly thought TDX module will avoid locking free entries for tdg_mem_page_accept() only. > > > > > > Proposal: (as in patch 1) > > - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY > > to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). > > > > - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as > > long as there are no pending signals/interrupts. > > > > The retry inside TDX aims to reduce the count of tdh_vp_enter() > > before resolving EPT violations in the local vCPU, thereby > > minimizing contentions with other vCPUs. However, it can't > > completely eliminate 0-step mitigation as it exits when there're > > pending signals/interrupts and does not (and cannot) remove the > > tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. > > > > Resources SHARED users EXCLUSIVE users > > ------------------------------------------------------------ > > SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) > > tdh_mem_page_aug > > ------------------------------------------------------------ > > SEPT entry tdh_mem_sept_add (Host lock) > > tdh_mem_page_aug (Host lock) > > tdg_mem_page_accept (Guest lock) > > tdg_mem_page_attr_rd (Guest lock) > > tdg_mem_page_attr_wr (Guest lock) > > > > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). > > These three SEAMCALLs are invoked for page uninstallation, with > > KVM mmu_lock held for writing. > > > > Resources SHARED users EXCLUSIVE users > > ------------------------------------------------------------ > > TDCS epoch tdh_vp_enter tdh_mem_track > > ------------------------------------------------------------ > > SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) > > tdh_mem_range_block > > ------------------------------------------------------------ > > SEPT entry tdh_mem_range_block (Host lock) > > tdh_mem_page_remove (Host lock) > > tdg_mem_page_accept (Guest lock) > > tdg_mem_page_attr_rd (Guest lock) > > tdg_mem_page_attr_wr (Guest lock) > > > > Proposal: (as in patch 2) > > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only > > once. > > - During the retry, kick off all vCPUs and prevent any vCPU from > > entering to avoid potential contentions. > > > > This is because tdh_vp_enter() and TDCALLs are not protected by > > KVM mmu_lock, and remove_external_spte() in KVM must not fail. > > The solution for group 3 actually doesn't look too bad at all to me. At least > from code and complexity wise. It's pretty compact, doesn't add any locks, and > limited to the tdx.c code. Although, I didn't evaluate the implementation > correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet. > > Were you able to test the fallback path at all? Can we think of any real > situations where it could be burdensome? Yes, I did some negative tests to fail block/track/remove to check if the tdx_no_vcpus_enter*() work. Even without those negative tests, it's not rare for tdh_mem_track() to return busy due to its contention with tdh_vp_enter(). Given that normally it's not frequent to find tdh_mem_range_block() or tdh_mem_page_remove() to return busy (if we reduce the frequency of zero-step mitigation) and that tdh_mem_track() will kick off all vCPUs later any way, I think it's acceptable to do the tdx_no_vcpus_enter*() stuffs in the page removal path. > One other thing to note I think, is that group 3 are part of no-fail operations. > The core KVM calling code doesn't have the understanding of a failure there. So > in this scheme of not avoiding contention we have to succeed before returning, > where group 1 and 2 can fail, so don't need the special fallback scheme. Yes, exactly.
For tdh_mem_page_add, Just return error when TDX_OPERAND_BUSY is found.
For tdh_mem_sept_add(), tdh_mem_page_aug(),
- Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
- Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
long as there are no pending signals/interrupts.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/vmx/tdx.c | 53 +++++++++++++++++++++++++++++++++++-------
2 files changed, 45 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd2eaa1dbebb..f16c0e7248eb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6179,7 +6179,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
vcpu->stat.pf_spurious++;
if (r != RET_PF_EMULATE)
- return 1;
+ return r;
emulate:
return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 23e6f25dd837..60d9e9d050ad 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1705,8 +1705,9 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
err = tdh_mem_sept_add(to_kvm_tdx(kvm)->tdr_pa, gpa, tdx_level, hpa, &entry,
&level_state);
- if (unlikely(err == TDX_ERROR_SEPT_BUSY))
- return -EAGAIN;
+ if (unlikely(err & TDX_OPERAND_BUSY))
+ return -EBUSY;
+
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
return -EIO;
@@ -1855,6 +1856,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
{
gpa_t gpa = tdexit_gpa(vcpu);
unsigned long exit_qual;
+ bool local_retry = false;
+ int ret;
if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
@@ -1873,6 +1876,23 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
* due to aliasing a single HPA to multiple GPAs.
*/
exit_qual = EPT_VIOLATION_ACC_WRITE;
+
+ /*
+ * Mapping of private memory may meet RET_PF_RETRY due to
+ * SEAMCALL contentions, e.g.
+ * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU vs
+ * TDH.VP.ENTER with 0-step mitigation on a remote vCPU.
+ * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU vs
+ * TDG.MEM.PAGE.ACCEPT on a remote vCPU.
+ *
+ * Retry internally in TDX to prevent exacerbating the
+ * activation of 0-step mitigation on local vCPU.
+ * However, despite these retries, the 0-step mitigation on the
+ * local vCPU may still be triggered due to:
+ * - Exiting on signals, interrupts.
+ * - KVM_EXIT_MEMORY_FAULT.
+ */
+ local_retry = true;
} else {
exit_qual = tdexit_exit_qual(vcpu);
/*
@@ -1885,7 +1905,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
}
trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
- return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+
+ while (1) {
+ ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+
+ if (ret != RET_PF_RETRY || !local_retry)
+ break;
+
+ /*
+ * Break and keep the orig return value.
+ * Signal & irq handling will be done later in vcpu_run()
+ */
+ if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
+ kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
+ break;
+
+ cond_resched();
+ }
+ return ret;
}
int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
@@ -3028,13 +3065,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
}
ret = 0;
- do {
- err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
- pfn_to_hpa(page_to_pfn(page)),
- &entry, &level_state);
- } while (err == TDX_ERROR_SEPT_BUSY);
+ err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, pfn_to_hpa(pfn),
+ pfn_to_hpa(page_to_pfn(page)),
+ &entry, &level_state);
if (err) {
- ret = -EIO;
+ ret = unlikely(err & TDX_OPERAND_BUSY) ? -EBUSY : -EIO;
goto out;
}
--
2.43.2
For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
- Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
- During the retry, kick off all vCPUs and prevent any vCPU from entering
to avoid potential contentions.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/tdx.c | 49 +++++++++++++++++++++++++--------
2 files changed, 40 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 521c7cf725bc..bb7592110337 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -123,6 +123,8 @@
#define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34)
+#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
+ KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 60d9e9d050ad..ed6b41bbcec6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
__mb();
}
+static void tdx_no_vcpus_enter_start(struct kvm *kvm)
+{
+ kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
+}
+
+static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
+{
+ struct kvm_vcpu *vcpu;
+ unsigned long i;
+
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_clear_request(KVM_REQ_NO_VCPU_ENTER_INPROGRESS, vcpu);
+}
+
/* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
static int __tdx_reclaim_page(hpa_t pa)
{
@@ -1648,15 +1662,20 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
return -EINVAL;
- do {
- /*
- * When zapping private page, write lock is held. So no race
- * condition with other vcpu sept operation. Race only with
- * TDH.VP.ENTER.
- */
+ /*
+ * When zapping private page, write lock is held. So no race
+ * condition with other vcpu sept operation. Race only with
+ * TDH.VP.ENTER.
+ */
+ err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
+ &level_state);
+ if ((err & TDX_OPERAND_BUSY)) {
+ /* After no vCPUs enter, the second retry is expected to succeed */
+ tdx_no_vcpus_enter_start(kvm);
err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &entry,
&level_state);
- } while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+ tdx_no_vcpus_enter_stop(kvm);
+ }
if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE &&
err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
@@ -1728,8 +1747,12 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
WARN_ON_ONCE(level != PG_LEVEL_4K);
err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
- if (unlikely(err == TDX_ERROR_SEPT_BUSY))
- return -EAGAIN;
+ if (unlikely(err & TDX_OPERAND_BUSY)) {
+ /* After no vCPUs enter, the second retry is expected to succeed */
+ tdx_no_vcpus_enter_start(kvm);
+ err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
+ tdx_no_vcpus_enter_stop(kvm);
+ }
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
return -EIO;
@@ -1772,9 +1795,13 @@ static void tdx_track(struct kvm *kvm)
lockdep_assert_held_write(&kvm->mmu_lock);
- do {
+ err = tdh_mem_track(kvm_tdx->tdr_pa);
+ if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY) {
+ /* After no vCPUs enter, the second retry is expected to succeed */
+ tdx_no_vcpus_enter_start(kvm);
err = tdh_mem_track(kvm_tdx->tdr_pa);
- } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+ tdx_no_vcpus_enter_stop(kvm);
+ }
if (KVM_BUG_ON(err, kvm))
pr_tdx_error(TDH_MEM_TRACK, err);
--
2.43.2
On Thu, Nov 21, 2024, Yan Zhao wrote:
> For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
>
> - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> - During the retry, kick off all vCPUs and prevent any vCPU from entering
> to avoid potential contentions.
>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/vmx/tdx.c | 49 +++++++++++++++++++++++++--------
> 2 files changed, 40 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 521c7cf725bc..bb7592110337 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -123,6 +123,8 @@
> #define KVM_REQ_HV_TLB_FLUSH \
> KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34)
> +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> + KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>
> #define CR0_RESERVED_BITS \
> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 60d9e9d050ad..ed6b41bbcec6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
> __mb();
> }
>
> +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> +{
> + kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
And with appropriate helpers in common code. I could have sworn I floated this
idea in the past for something else, but apparently not. The only thing I can
find is an old arm64 version for pausing vCPUs to emulated. Hmm, maybe I was
thinking of KVM_REQ_OUTSIDE_GUEST_MODE?
Anyways, I don't see any reason to make this an arch specific request.
On Tue, Dec 17, 2024 at 03:29:03PM -0800, Sean Christopherson wrote:
> On Thu, Nov 21, 2024, Yan Zhao wrote:
> > For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
> >
> > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> > - During the retry, kick off all vCPUs and prevent any vCPU from entering
> > to avoid potential contentions.
> >
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/vmx/tdx.c | 49 +++++++++++++++++++++++++--------
> > 2 files changed, 40 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 521c7cf725bc..bb7592110337 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -123,6 +123,8 @@
> > #define KVM_REQ_HV_TLB_FLUSH \
> > KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34)
> > +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> > + KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >
> > #define CR0_RESERVED_BITS \
> > (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 60d9e9d050ad..ed6b41bbcec6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
> > __mb();
> > }
> >
> > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > +{
> > + kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
>
> I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
KVM_REQ_PAUSE looks good to me. But will the "pause" cause any confusion with
the guest's pause state?
> And with appropriate helpers in common code. I could have sworn I floated this
> idea in the past for something else, but apparently not. The only thing I can
Yes, you suggested me to implement it via a request, similar to
KVM_REQ_MCLOCK_INPROGRESS. [1].
(I didn't add your suggested-by tag in this patch because it's just an RFC).
[1] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com/
> find is an old arm64 version for pausing vCPUs to emulated. Hmm, maybe I was
> thinking of KVM_REQ_OUTSIDE_GUEST_MODE?
KVM_REQ_OUTSIDE_GUEST_MODE just kicks vCPUs outside guest mode, it does not set
a bit in vcpu->requests to prevent later vCPUs entering.
> Anyways, I don't see any reason to make this an arch specific request.
After making it non-arch specific, probably we need an atomic counter for the
start/stop requests in the common helpers. So I just made it TDX-specific to
keep it simple in the RFC.
Will do in non-arch specific way if you think it's worth.
On Wed, Dec 18, 2024, Yan Zhao wrote:
> On Tue, Dec 17, 2024 at 03:29:03PM -0800, Sean Christopherson wrote:
> > On Thu, Nov 21, 2024, Yan Zhao wrote:
> > > For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
> > >
> > > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> > > - During the retry, kick off all vCPUs and prevent any vCPU from entering
> > > to avoid potential contentions.
> > >
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > arch/x86/kvm/vmx/tdx.c | 49 +++++++++++++++++++++++++--------
> > > 2 files changed, 40 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 521c7cf725bc..bb7592110337 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -123,6 +123,8 @@
> > > #define KVM_REQ_HV_TLB_FLUSH \
> > > KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > > #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34)
> > > +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> > > + KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > >
> > > #define CR0_RESERVED_BITS \
> > > (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 60d9e9d050ad..ed6b41bbcec6 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
> > > __mb();
> > > }
> > >
> > > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > > +{
> > > + kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> >
> > I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
> KVM_REQ_PAUSE looks good to me. But will the "pause" cause any confusion with
> the guest's pause state?
Maybe?
> > And with appropriate helpers in common code. I could have sworn I floated this
> > idea in the past for something else, but apparently not. The only thing I can
> Yes, you suggested me to implement it via a request, similar to
> KVM_REQ_MCLOCK_INPROGRESS. [1].
> (I didn't add your suggested-by tag in this patch because it's just an RFC).
>
> [1] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com/
>
> > find is an old arm64 version for pausing vCPUs to emulated. Hmm, maybe I was
> > thinking of KVM_REQ_OUTSIDE_GUEST_MODE?
> KVM_REQ_OUTSIDE_GUEST_MODE just kicks vCPUs outside guest mode, it does not set
> a bit in vcpu->requests to prevent later vCPUs entering.
Yeah, I was mostly just talking to myself. :-)
> > Anyways, I don't see any reason to make this an arch specific request.
> After making it non-arch specific, probably we need an atomic counter for the
> start/stop requests in the common helpers. So I just made it TDX-specific to
> keep it simple in the RFC.
Oh, right. I didn't consider the complications with multiple users. Hrm.
Actually, this doesn't need to be a request. KVM_REQ_OUTSIDE_GUEST_MODE will
forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does. E.g.
have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
and then in tdx_vcpu_run():
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b49dcf32206b..508ad6462e6d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -921,6 +921,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
return EXIT_FASTPATH_NONE;
}
+ if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
+ return EXIT_FASTPATH_EXIT_HANDLED;
+
trace_kvm_entry(vcpu, force_immediate_exit);
if (pi_test_on(&tdx->pi_desc)) {
Ooh, but there's a subtle flaw with that approach. Unlike kvm_vcpu_exit_request(),
the above check would obviously happen before entry to the guest, which means that,
in theory, KVM needs to goto cancel_injection to re-request req_immediate_exit and
cancel injection:
if (req_immediate_exit)
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_x86_call(cancel_injection)(vcpu);
But! This actually an opportunity to harden KVM. Because the TDX module doesn't
guarantee entry, it's already possible for KVM to _think_ it completely entry to
the guest without actually having done so. It just happens to work because KVM
never needs to force an immediate exit for TDX, and can't do direct injection,
i.e. can "safely" skip the cancel_injection path.
So, I think can and should go with the above suggestion, but also add a WARN on
req_immediate_exit being set, because TDX ignores it entirely.
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b49dcf32206b..e23cd8231144 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -914,6 +914,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
struct vcpu_tdx *tdx = to_tdx(vcpu);
struct vcpu_vt *vt = to_tdx(vcpu);
+ /* <comment goes here> */
+ WARN_ON_ONCE(force_immediate_exit);
+
/* TDX exit handle takes care of this error case. */
if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
tdx->vp_enter_ret = TDX_SW_ERROR;
@@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
return EXIT_FASTPATH_NONE;
}
+ if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
+ return EXIT_FASTPATH_EXIT_HANDLED;
+
trace_kvm_entry(vcpu, force_immediate_exit);
if (pi_test_on(&tdx->pi_desc)) {
On Wed, Dec 18, 2024 at 08:10:48AM -0800, Sean Christopherson wrote:
> On Wed, Dec 18, 2024, Yan Zhao wrote:
> > On Tue, Dec 17, 2024 at 03:29:03PM -0800, Sean Christopherson wrote:
> > > On Thu, Nov 21, 2024, Yan Zhao wrote:
> > > > For tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(),
> > > >
> > > > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only once.
> > > > - During the retry, kick off all vCPUs and prevent any vCPU from entering
> > > > to avoid potential contentions.
> > > >
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > ---
> > > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > > arch/x86/kvm/vmx/tdx.c | 49 +++++++++++++++++++++++++--------
> > > > 2 files changed, 40 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 521c7cf725bc..bb7592110337 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -123,6 +123,8 @@
> > > > #define KVM_REQ_HV_TLB_FLUSH \
> > > > KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > > > #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34)
> > > > +#define KVM_REQ_NO_VCPU_ENTER_INPROGRESS \
> > > > + KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > > >
> > > > #define CR0_RESERVED_BITS \
> > > > (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index 60d9e9d050ad..ed6b41bbcec6 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -311,6 +311,20 @@ static void tdx_clear_page(unsigned long page_pa)
> > > > __mb();
> > > > }
> > > >
> > > > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > > > +{
> > > > + kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> > >
> > > I vote for making this a common request with a more succient name, e.g. KVM_REQ_PAUSE.
> > KVM_REQ_PAUSE looks good to me. But will the "pause" cause any confusion with
> > the guest's pause state?
>
> Maybe?
>
> > > And with appropriate helpers in common code. I could have sworn I floated this
> > > idea in the past for something else, but apparently not. The only thing I can
> > Yes, you suggested me to implement it via a request, similar to
> > KVM_REQ_MCLOCK_INPROGRESS. [1].
> > (I didn't add your suggested-by tag in this patch because it's just an RFC).
> >
> > [1] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com/
> >
> > > find is an old arm64 version for pausing vCPUs to emulated. Hmm, maybe I was
> > > thinking of KVM_REQ_OUTSIDE_GUEST_MODE?
> > KVM_REQ_OUTSIDE_GUEST_MODE just kicks vCPUs outside guest mode, it does not set
> > a bit in vcpu->requests to prevent later vCPUs entering.
>
> Yeah, I was mostly just talking to myself. :-)
>
> > > Anyways, I don't see any reason to make this an arch specific request.
> > After making it non-arch specific, probably we need an atomic counter for the
> > start/stop requests in the common helpers. So I just made it TDX-specific to
> > keep it simple in the RFC.
>
> Oh, right. I didn't consider the complications with multiple users. Hrm.
>
> Actually, this doesn't need to be a request. KVM_REQ_OUTSIDE_GUEST_MODE will
> forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
> EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does. E.g.
> have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
Hmm, the wait_for_sept_zap also needs to be set and unset in all vCPUs except
the current one.
> and then in tdx_vcpu_run():
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b49dcf32206b..508ad6462e6d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -921,6 +921,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> return EXIT_FASTPATH_NONE;
> }
>
> + if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap)))
> + return EXIT_FASTPATH_EXIT_HANDLED;
> +
> trace_kvm_entry(vcpu, force_immediate_exit);
>
> if (pi_test_on(&tdx->pi_desc)) {
>
>
> Ooh, but there's a subtle flaw with that approach. Unlike kvm_vcpu_exit_request(),
> the above check would obviously happen before entry to the guest, which means that,
> in theory, KVM needs to goto cancel_injection to re-request req_immediate_exit and
> cancel injection:
>
> if (req_immediate_exit)
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> kvm_x86_call(cancel_injection)(vcpu);
>
> But! This actually an opportunity to harden KVM. Because the TDX module doesn't
> guarantee entry, it's already possible for KVM to _think_ it completely entry to
> the guest without actually having done so. It just happens to work because KVM
> never needs to force an immediate exit for TDX, and can't do direct injection,
> i.e. can "safely" skip the cancel_injection path.
>
> So, I think can and should go with the above suggestion, but also add a WARN on
> req_immediate_exit being set, because TDX ignores it entirely.
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b49dcf32206b..e23cd8231144 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -914,6 +914,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> struct vcpu_tdx *tdx = to_tdx(vcpu);
> struct vcpu_vt *vt = to_tdx(vcpu);
>
> + /* <comment goes here> */
> + WARN_ON_ONCE(force_immediate_exit);
Better to put this hardending a separate fix to
commit 37d3baf545cd ("KVM: TDX: Implement TDX vcpu enter/exit path") ?
It's required no matter which approach is chosen for SEPT SEACALL retry.
> /* TDX exit handle takes care of this error case. */
> if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
> tdx->vp_enter_ret = TDX_SW_ERROR;
> @@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> return EXIT_FASTPATH_NONE;
> }
>
> + if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
> + return EXIT_FASTPATH_EXIT_HANDLED;
> +
> trace_kvm_entry(vcpu, force_immediate_exit);
>
> if (pi_test_on(&tdx->pi_desc)) {
Thanks for this suggestion.
But what's the advantage of this checking wait_for_sept_zap approach?
Is it to avoid introducing an arch specific request?
On Thu, Dec 19, 2024, Yan Zhao wrote:
> On Wed, Dec 18, 2024 at 08:10:48AM -0800, Sean Christopherson wrote:
> > On Wed, Dec 18, 2024, Yan Zhao wrote:
> > > > Anyways, I don't see any reason to make this an arch specific request.
> > > After making it non-arch specific, probably we need an atomic counter for the
> > > start/stop requests in the common helpers. So I just made it TDX-specific to
> > > keep it simple in the RFC.
> >
> > Oh, right. I didn't consider the complications with multiple users. Hrm.
> >
> > Actually, this doesn't need to be a request. KVM_REQ_OUTSIDE_GUEST_MODE will
> > forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
> > EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does. E.g.
> > have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
> Hmm, the wait_for_sept_zap also needs to be set and unset in all vCPUs except
> the current one.
Why can't it be a VM-wide flag? The current vCPU isn't going to do VP.ENTER, is
it? If it is, I've definitely missed something :-)
> > /* TDX exit handle takes care of this error case. */
> > if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
> > tdx->vp_enter_ret = TDX_SW_ERROR;
> > @@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> > return EXIT_FASTPATH_NONE;
> > }
> >
> > + if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
> > + return EXIT_FASTPATH_EXIT_HANDLED;
> > +
> > trace_kvm_entry(vcpu, force_immediate_exit);
> >
> > if (pi_test_on(&tdx->pi_desc)) {
> Thanks for this suggestion.
> But what's the advantage of this checking wait_for_sept_zap approach?
> Is it to avoid introducing an arch specific request?
Yes, and unless I've missed something, "releasing" vCPUs can be done by clearing
a single variable.
On Wed, Dec 18, 2024 at 06:39:29PM -0800, Sean Christopherson wrote:
> On Thu, Dec 19, 2024, Yan Zhao wrote:
> > On Wed, Dec 18, 2024 at 08:10:48AM -0800, Sean Christopherson wrote:
> > > On Wed, Dec 18, 2024, Yan Zhao wrote:
> > > > > Anyways, I don't see any reason to make this an arch specific request.
> > > > After making it non-arch specific, probably we need an atomic counter for the
> > > > start/stop requests in the common helpers. So I just made it TDX-specific to
> > > > keep it simple in the RFC.
> > >
> > > Oh, right. I didn't consider the complications with multiple users. Hrm.
> > >
> > > Actually, this doesn't need to be a request. KVM_REQ_OUTSIDE_GUEST_MODE will
> > > forces vCPUs to exit, at which point tdx_vcpu_run() can return immediately with
> > > EXIT_FASTPATH_EXIT_HANDLED, which is all that kvm_vcpu_exit_request() does. E.g.
> > > have the zap side set wait_for_sept_zap before blasting the request to all vCPU,
> > Hmm, the wait_for_sept_zap also needs to be set and unset in all vCPUs except
> > the current one.
>
> Why can't it be a VM-wide flag? The current vCPU isn't going to do VP.ENTER, is
> it? If it is, I've definitely missed something :-)
Ah, right. You are not missing anything. I just forgot it can be a VM-wide flag....
Sorry.
>
> > > /* TDX exit handle takes care of this error case. */
> > > if (unlikely(tdx->state != VCPU_TD_STATE_INITIALIZED)) {
> > > tdx->vp_enter_ret = TDX_SW_ERROR;
> > > @@ -921,6 +924,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
> > > return EXIT_FASTPATH_NONE;
> > > }
> > >
> > > + if (unlikely(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))
> > > + return EXIT_FASTPATH_EXIT_HANDLED;
> > > +
> > > trace_kvm_entry(vcpu, force_immediate_exit);
> > >
> > > if (pi_test_on(&tdx->pi_desc)) {
> > Thanks for this suggestion.
> > But what's the advantage of this checking wait_for_sept_zap approach?
> > Is it to avoid introducing an arch specific request?
>
> Yes, and unless I've missed something, "releasing" vCPUs can be done by clearing
> a single variable.
Right. Will do in this way in the next version.
On Thu, 2024-11-21 at 19:57 +0800, Yan Zhao wrote:
>
> +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> +{
I wonder if an mmu write lock assert here would be excessive.
> + kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> +}
> +
> +static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
> +{
> + struct kvm_vcpu *vcpu;
> + unsigned long i;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + kvm_clear_request(KVM_REQ_NO_VCPU_ENTER_INPROGRESS, vcpu);
> +}
> +
On Tue, Nov 26, 2024 at 08:47:42AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-11-21 at 19:57 +0800, Yan Zhao wrote:
> >
> > +static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> > +{
>
> I wonder if an mmu write lock assert here would be excessive.
Hmm, I didn't do it because all current callers already assert on that.
But asserting on that would be safer if the function itself doesn't take a lock
or a counter.
> > + kvm_make_all_cpus_request(kvm, KVM_REQ_NO_VCPU_ENTER_INPROGRESS);
> > +}
> > +
> > +static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
> > +{
> > + struct kvm_vcpu *vcpu;
> > + unsigned long i;
> > +
> > + kvm_for_each_vcpu(i, vcpu, kvm)
> > + kvm_clear_request(KVM_REQ_NO_VCPU_ENTER_INPROGRESS, vcpu);
> > +}
> > +
>
© 2016 - 2025 Red Hat, Inc.