Documentation/arch/x86/tdx.rst | 21 + arch/x86/coco/tdx/tdx.c | 6 +- arch/x86/include/asm/kvm-x86-ops.h | 2 + arch/x86/include/asm/kvm_host.h | 11 +- arch/x86/include/asm/shared/tdx.h | 1 + arch/x86/include/asm/shared/tdx_errno.h | 109 +++++ arch/x86/include/asm/tdx.h | 76 ++- arch/x86/include/asm/tdx_global_metadata.h | 1 + arch/x86/kvm/mmu/mmu.c | 4 +- arch/x86/kvm/mmu/mmu_internal.h | 2 +- arch/x86/kvm/vmx/tdx.c | 157 ++++-- arch/x86/kvm/vmx/tdx.h | 3 +- arch/x86/kvm/vmx/tdx_errno.h | 40 -- arch/x86/virt/vmx/tdx/tdx.c | 505 +++++++++++++++++--- arch/x86/virt/vmx/tdx/tdx.h | 5 +- arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 + 16 files changed, 766 insertions(+), 180 deletions(-) create mode 100644 arch/x86/include/asm/shared/tdx_errno.h delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h
Hi, This is 3rd revision of Dynamic PAMT, which is a new feature that reduces memory use of TDX. On v2 (as well as in PUCK) there was some discussion of the refcount/locking design tradeoffs for Dynamic PAMT. In v2, I’ve basically gone through and tried to make the details around this more reviewable. The basic solution is the same as v2, with the changes more about moving code around or splitting implementations/optimizations. I’m hoping with this v3 we can close on whether that approach is good enough or not. I think the patch quality is in ok shape, but still need some review. Maintainers, please feel free to let us go through this v3 for lower level code issues, but I would appreciate engagement on the overall design. Another open still is performance testing, besides the bit about excluding contention of the global lock. Lastly, Yan raised some last minute doubts internally about TDX module locking contention. I’m not sure there is a problem, but we can come to an agreement as part of the review. PAMT Background =============== The TDX module needs to keep data about each physical page it uses. It requires the kernel to give it memory to use for this purpose, called PAMT. Internally it wants space for metadata for each page *and* each page size. That is, if a page is mapped at 2MB in a TD, it doesn’t spread this tracking across the allocations it uses for 4KB page size usage of the same physical memory. It is designed to use a separate allocation for this. So each memory region that the TDX module could use (aka TDMRs) has three of these PAMT allocations. They are all allocated during the global TDX initialization, regardless of if the memory is actually getting used for a TD. It uses up approximately 0.4% of system memory. Dynamic PAMT (DPAMT) ==================== Fortunately, only using physical memory for areas of an address space that are actually in use is a familiar problem in system engineering, with a well trodden solution: page tables. It would be great if TDX could do something like that for PAMT. This is basically the idea for Dynamic PAMT. However, there are some design aspects that could be surprising for anyone expecting “PAMT, but page tables”. The following describes these differences. DPAMT Levels ------------ Dynamic PAMT focuses on the page size level that has the biggest PAMT allocation - 4KB page size. Since the 2MB and 1GB levels are smaller allocations, they are left as the fixed arrays, as in normal PAMT. But the 4KB page sizes are actually not fully dynamic either, the TDX module still requires a physically contiguous space for tracking each 4KB page in a TDMR. But this space shrinks significantly, to currently 1 bit per page. Page Sizes ---------- Like normal PAMT, Dynamic PAMT wants to provide a way for the TDX module to have separate PAMT tracking for different page sizes. But unlike normal PAMT, it does not seamlessly switch between the 2MB and 4KB page sizes without VMM action. It wants the PAMT mapped at the same level that the underlying TDX memory is using. In practice this means the VMM needs to update the PAMT depending on whether secure EPT pages are to be mapped at 2MB or 4KB. While demote/promote have internal handling for these updates (and support passing or returning the PAMT pages involved), PAGE.ADD/AUG don’t. Instead two new SEAMCALLs are provided for the VMM to configure the PAMT to the intended page size (i.e. 4KB if the page will be mapped at 4KB): TDH.PHYMEM.PAMT.ADD/REMOVE. While some operations on TD memory can internally handle the PAMT adjustments, the opposite is not true. That is changes in PAMT don’t try to automatically change private S-EPT page sizes. Instead an attempt to remove 4KB size PAMT pages while will fail if any of the covering range are in use. Concurrency ----------- For every 2MB physical range there could be many 4KB pages used by TDX (obviously). But each of those only needs one set of PAMT pages added. So on the first use of the 2MB region the DPAMT needs to be installed, and after none of the pages in that range are in use, it needs to be freed. The TDX module actually does track how many pages are using each 2MB range and gives hints for refcount of zero. But it is incremented when the 2MB region use actually starts. Like: 1. TDH.PHYMEM.PAMT.ADD (backing for 2MB range X, for page Y) 2. TDH.MEM.PAGE.AUG (page Y) <- Increments refcount for X. 3. TDH.MEM.PAGE.REMOVE (page Y) <- Decrements refcount for X, gives hint in return value. 4. TDH.PHYMEM.PAMT.REMOVE (range X) <- Remove, check that each 4KB page in X is free. The internal refcount is tricky to use because of the window of time between TDH.PHYMEM.PAMT.ADD and TDH.MEM.PAGE.AUG. The PAMT.ADD adds the backing, but doesn’t tell the TDX module a VMM intends to map it. Consider the range X that includes page Y and Z, for an implementation that tries to use these hints: CPU 0 CPU 1 TDH.PHYMEM.PAMT.ADD X (returns already mapped) TDH.MEM.PAGE.REMOVE (Y) (returns refcount 0 hint) TDH.PHYMEM.PAMT.REMOVE (X) TDH.MEM.PAGE.AUG Z (fail) So the TDX module’s DPAMT refcounts don't track what the VMM intends to do with the page, only what it has already done. This leaves a window that needs to be covered. TDX module locking ------------------ Inside the TDX module there are various locks. The TDX module does not wait when it encounters contention, instead it returns a BUSY error code. This leaves the VMM with an option to either loop around the SEAMCALL, or return an error to the calling code. In some cases in Linux there is not an option to return an error from the operation making the SEAMCALL. To avoid potentially an indeterminable amount of retries, it opts to kick all the threads associated with the TD out of the TDX module and retry. This retry operation is fairly easy to do from with KVM. Since PAMT is a global resource, this means that this lock contention could be from any TD. For normal PAMT, the exclusive locks locks are only taken at the 4KB page size granularity. In practice, this means any page that is not shared between TDs won’t have to worry about contention. However, for DPAMT this changes. The TDH.PHYMEM.PAMT.ADD/REMOVE calls take a PAMT lock at 2MB granularity. If two calls try to operate on the same 2MB region at the same time, one will get the BUSY error code. Linux Challenges ================ So for dynamic PAMT, Linux needs to: 1. Allocate a different fixed sized allocation for each TDMR, for the 4KB page size (the 1 bit per page bitmap instead of the normal larger allocation) 2. Allocate DPAMT for control structures. 3. Allocate DPAMT 4KB page backings for all TD private memory (which is currently only 4KB) and S-EPT page tables. 1 is easy, just query the new size when DPAMT is in use and use that instead of the regular 4KB PAMT size. The bitmap allocation is even passed in the same field to the TDX module. If you takeaway the TDX docs naming around the bitmap, it’s like a buffer that changes size. For 2 and 3, there is a lot to consider. For 2, it is relatively easy as long as we want to install PAMT on demand since these pages come straight from the page allocator and not guestmemfd. For 3, upstream we currently only have 4KB pages, which means we could ignore a lot of the specifics about matching page sizes - there is only one. However, TDX huge pages is also in progress. So we should avoid a design that would need to be redone immediately. Installing 4KB DPAMT backings ----------------------------- Some TDX pages are used for TD control structures. These need to have new code to install 4KB DPAMT backings. But the main problem is how do this for private memory. Linux needs to add the DPAMT backing private memory before the page gets mapped in the TD. Doing so inside the KVM MMU call paths adds complications around the BUSY error code, as described above. It would be tempting to install DPAMT pages from a guestmemfd callback. This could happen outside the KVM MMU locks. But there are three complications with this. One is that in case of 2MB pages, the guest can control the page size. This means, even if 4KB page sizes are installed automatically, KVM would have to handle edge cases of PAMT adjustments at runtime anyway. For example memslot deletion and re-adding would trigger a zap of huge pages that are later remapped at 4KB. This could *maybe* be worked around by some variant of this technique [0]. Another wrinkle is that Vishal@google has expressed a strong interest in saving PAMT memory at runtime in the case of 2MB TD private memory. He wants to support a use case where most TD memory is mapped at 2MB, so he wants to avoid the overhead of a worse case allocation that assumes all memory will be mapped at 4KB. Finally, pre-installing DPAMT pages before the fault doesn’t help with mapping DPAMT pages for the external (S-EPT) page tables that are allocated for the fault. So some fault time logic is needed. We could pre-install DPAMT backing for the external page table cache, which would happen outside of the MMU lock. This would free us from having to update DPAMT inside MMU lock. But it would not free KVM from having to do anything around DPAMT during a fault. Three non-show stopping issues tilts things towards using a fault time DPAMT installation approach for private memory. Avoiding duplicate attempts to add/remove DPAMT ----------------------------------------------- As covered above, there isn’t a refcount in the TDX module that we can use. Using the hints returned by TDH.MEM.PAGE.AUG/REMOVE was tried in v1, and Kirill was unable to both overcome the races and make nice with new failure scenarios around DPAMT locks. So in v2 refcounts were allocated on the kernel side for each 2MB range covered by a TDMR (a range the TDX module might use). This adds some small memory overhead of 0.0002%, which is small compared to the 0.4% to 0.004% savings of Dynamic PAMT. The major downside is code complexity. These allocations are still large and involve managing vmalloc space. The v2 solution reserves a vmalloc space to cover the entire physical address space, and only maps pages for any ranges that are covered by a TDMR. Avoiding contention ------------------- As covered above, Linux needs to deal with getting BUSY error codes here. The allocation/mapping paths for all these pages can already handle failure, but the trickier case is the removal paths. As previously sparred with during the base series, TDX module expects to be able to fail these calls, but the kernel does not. Further, the refcounts can not act as a race free lock on their own. So some synchronization is needed before actually removing the DPAMT backings. V2 of the series includes a global lock to be used around actual installation/removal of the DPAMT backing, combined with opportunistic checking outside the lock to avoid taking it most of the time. In testing, booting 10 16GB TDs, the lock only hit contention 1136 times, with 4ms waiting. This is very small for an operation that took 60s of wall time. So despite being an (ugly) global lock, the actual impact was small. It will probably further be reduced in the case of huge pages, where most of the time 4KB DPAMT installation will not be necessary. Updates in v3 ============= Besides incorporating the feedback and general cleanups, the major design change was around how DPAMT backing pages are allocated in the fault path. Allocating DPAMT pages in v2 ---------------------------- In v2, there was a specific page cache added in the generic x86 KVM MMU for DPAMT pages. This was needed because much of the fault happens inside a spinlock. By the time the fault handler knows whether it needs to install DPAMT backing, it can no longer allocate pages. This is a common pattern in the KVM MMU, and so pages are pre-allocated before taking the MMU spinlock. This way they can be used later if needed. But KVM’s page cache infrastructure is not easy to pass into the arch/86 code and inside the global spin lock where the pages would be consumed. So instead it passed a pointer to a function inside KVM that it can call to allocate pages from the KVM page cache component. Since the control structures need DPAMT backing installed outside of the fault, the arch/x86 code also had logic to allocate pages directly from the page allocator. Further, there were various resulting intermediate lists that had to be marshaled through the DPMAT allocation paths. This was all a bit complicated to me. Updates in v3 ------------- V3 redid the areas described above to try to simplify it. The KVM MMU already has knowledge that TDX needs special memory allocated for S-EPT. From the fault handlers perspective, this could be seen as just more memory of the same type. So v3 just turns the external page table allocation to an x86 op, and provides another op to allocate from it. This is done as refactoring. Then when dynamic PAMT is added the extra pages can just be added from within the x86 op in TDX code. For removing the function pointer callback scheme, the external page tables are switched to a dirt simple linked list based page cache. This is somewhat reinventing the wheel, but KVM’s kvm_mmu_memory_cache operations are not easy to expose to the core kernel, and also TDX doesn’t need much of the fanciness of initial values and things like that. Building it out of the kernels linked list is enough code reuse. Today the TDX module needs 2 pages for 2MB region of 4KB size dynamic PAMT backing. It would be tempting to just pass two pages in, but the TDX module exposes the number of Dynamic PAMT pages it needs as a metadata value. So the size is technically variable. To handle this, the design just passes in the simple TDX page cache list in the calls that might need to allocate dynamic PAMT. Considerations for v4 ===================== This solution seems workable. It isolated Dynamic PAMT to TDX code, and doesn’t introduce any extra constraints to generic x86 KVM code. But the refcounts and global lock in arch/x86 side of TDX are still ugly. There has been some internal discussion about pursuing various schemes to avoid this. But before a potential redesign, I wanted to share the current version. Both to get feedback on the updates, and so we can consider how “good enough” the current design is. Testing and branch ================== Testing is a bit light currently. Just TDX selftests, simple TDX Linux guest boot. The branch is here: https://github.com/intel/tdx/commits/dpamt_v3/ Based on kvm_x86/next (603c090664d3) [0] https://lore.kernel.org/kvm/20250807094423.4644-1-yan.y.zhao@intel.com/ Kirill A. Shutemov (13): x86/tdx: Move all TDX error defines into <asm/shared/tdx_errno.h> x86/tdx: Add helpers to check return status codes x86/virt/tdx: Allocate page bitmap for Dynamic PAMT x86/virt/tdx: Allocate reference counters for PAMT memory x86/virt/tdx: Improve PAMT refcounters allocation for sparse memory x86/virt/tdx: Add tdx_alloc/free_page() helpers x86/virt/tdx: Optimize tdx_alloc/free_page() helpers KVM: TDX: Allocate PAMT memory for TD control structures KVM: TDX: Allocate PAMT memory for vCPU control structures KVM: TDX: Handle PAMT allocation in fault path KVM: TDX: Reclaim PAMT memory x86/virt/tdx: Enable Dynamic PAMT Documentation/x86: Add documentation for TDX's Dynamic PAMT Rick Edgecombe (3): x86/virt/tdx: Simplify tdmr_get_pamt_sz() KVM: TDX: Add x86 ops for external spt cache x86/virt/tdx: Add helpers to allow for pre-allocating pages Documentation/arch/x86/tdx.rst | 21 + arch/x86/coco/tdx/tdx.c | 6 +- arch/x86/include/asm/kvm-x86-ops.h | 2 + arch/x86/include/asm/kvm_host.h | 11 +- arch/x86/include/asm/shared/tdx.h | 1 + arch/x86/include/asm/shared/tdx_errno.h | 109 +++++ arch/x86/include/asm/tdx.h | 76 ++- arch/x86/include/asm/tdx_global_metadata.h | 1 + arch/x86/kvm/mmu/mmu.c | 4 +- arch/x86/kvm/mmu/mmu_internal.h | 2 +- arch/x86/kvm/vmx/tdx.c | 157 ++++-- arch/x86/kvm/vmx/tdx.h | 3 +- arch/x86/kvm/vmx/tdx_errno.h | 40 -- arch/x86/virt/vmx/tdx/tdx.c | 505 +++++++++++++++++--- arch/x86/virt/vmx/tdx/tdx.h | 5 +- arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 + 16 files changed, 766 insertions(+), 180 deletions(-) create mode 100644 arch/x86/include/asm/shared/tdx_errno.h delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h -- 2.51.0
On Thu, Sep 18, 2025 at 04:22:08PM -0700, Rick Edgecombe wrote: > Hi, > > This is 3rd revision of Dynamic PAMT, which is a new feature that reduces > memory use of TDX. > > On v2 (as well as in PUCK) there was some discussion of the > refcount/locking design tradeoffs for Dynamic PAMT. In v2, I’ve basically > gone through and tried to make the details around this more reviewable. > The basic solution is the same as v2, with the changes more about moving > code around or splitting implementations/optimizations. I’m hoping with > this v3 we can close on whether that approach is good enough or not. > > I think the patch quality is in ok shape, but still need some review. > Maintainers, please feel free to let us go through this v3 for lower level > code issues, but I would appreciate engagement on the overall design. > > Another open still is performance testing, besides the bit about excluding > contention of the global lock. > > Lastly, Yan raised some last minute doubts internally about TDX module > locking contention. I’m not sure there is a problem, but we can come to an > agreement as part of the review. Yes, I found a contention issue that prevents us from dropping the global lock. I've also written a sample test that demonstrates this contention. CPU 0 CPU 1 (a) TDH.PHYMEM.PAMT.ADD(A1, B1, xx1) (b) TDH.PHYMEM.PAMT.ADD(B2, xx2, xx3) A1, B2 are not from the same 2MB physical range, B1, B2 are from the same 2MB physical range. Physical addresses of xx1, xx2, xx3 are irrelevant. (a) adds PAMT pages B1, xx1 for A1's 2MB physical range. (b) adds PAMT pages xx2, xx3 for B2's 2MB physical range. This could happen when host wants to AUG a 4KB page A1. So, it allocates 4KB pages B1, xx1, and invokes (a). Then host wants to AUG a 4KB page B2. Since there're no PAMT pages for B2's 2MB physical range yet, host invokes (b). (a) needs to hold shared lock of B1's 2MB PAMT entry. (b) needs to hold exclusive lock of B2's 2MB PAMT entry.
On 9/25/25 19:28, Yan Zhao wrote: >> Lastly, Yan raised some last minute doubts internally about TDX >> module locking contention. I’m not sure there is a problem, but we >> can come to an agreement as part of the review. > Yes, I found a contention issue that prevents us from dropping the > global lock. I've also written a sample test that demonstrates this > contention. But what is the end result when this contention happens? Does everyone livelock?
On Fri, 2025-09-26 at 07:09 -0700, Dave Hansen wrote: > On 9/25/25 19:28, Yan Zhao wrote: > > > Lastly, Yan raised some last minute doubts internally about TDX > > > module locking contention. I’m not sure there is a problem, but > > > we > > > can come to an agreement as part of the review. > > Yes, I found a contention issue that prevents us from dropping the > > global lock. I've also written a sample test that demonstrates this > > contention. > But what is the end result when this contention happens? Does > everyone livelock? You get a TDX_OPERAND_BUSY error code returned. Inside the TDX module each lock is a try lock. The TDX module tries to take a sequence of locks and if it meets any contention it will release them all and return the TDX_OPERAND_BUSY. Some paths in KVM cannot handle failure and so don't have a way to handle the error. So another option to handling this is just to retry until you succeed. Then you have a very strange spinlock with a heavyweight inner loop. But since each time the locks are released, some astronomical bad timing might prevent forward progress. On the KVM side we have avoided looping. Although, I think we have not exhausted this path.
On 9/26/25 09:02, Edgecombe, Rick P wrote: > On Fri, 2025-09-26 at 07:09 -0700, Dave Hansen wrote: >> On 9/25/25 19:28, Yan Zhao wrote: >>>> Lastly, Yan raised some last minute doubts internally about TDX >>>> module locking contention. I’m not sure there is a problem, but >>>> we >>>> can come to an agreement as part of the review. >>> Yes, I found a contention issue that prevents us from dropping the >>> global lock. I've also written a sample test that demonstrates this >>> contention. >> But what is the end result when this contention happens? Does >> everyone livelock? > > You get a TDX_OPERAND_BUSY error code returned. Inside the TDX module > each lock is a try lock. The TDX module tries to take a sequence of > locks and if it meets any contention it will release them all and > return the TDX_OPERAND_BUSY. Some paths in KVM cannot handle failure > and so don't have a way to handle the error. > > So another option to handling this is just to retry until you succeed. > Then you have a very strange spinlock with a heavyweight inner loop. > But since each time the locks are released, some astronomical bad > timing might prevent forward progress. On the KVM side we have avoided > looping. Although, I think we have not exhausted this path. If it can't return failure then the _only_ other option is to spin. Right? I understand the reluctance to have such a nasty spin loop. But other than reworking the KVM code to do the retries at a higher level, is there another option?
On Fri, 2025-09-26 at 09:11 -0700, Dave Hansen wrote: > If it can't return failure then the _only_ other option is to spin. > Right? Yea, but you could spin around the SEAMCALL or you could spin on duplicate locks on the kernel side before making the SEAMCALL. Or put more generally, you could prevent contention before you make the SEACMALL. KVM does this also by kicking vCPUs out of the TDX module via IPI in other cases. > > I understand the reluctance to have such a nasty spin loop. But other > than reworking the KVM code to do the retries at a higher level, Re-working KVM code would be tough, although teaching KVM to fail zap calls has come up before for TDX/gmem interactions. It was looked at and decided to be too complex. Now I guess the benefit side of the equation changes a little bit, but doing it only for TDX might still be a bridge to far. Unless anyone is holding onto another usage that might want this? > is there another option? I don't see why we can't just duplicate the locking in a more matching way on the kernel side. Before the plan to someday drop the global lock if needed, was to switch to 2MB granular locks to match the TDX module's exclusive lock internal behavior. What Yan is basically pointing out is that there are shared locks that are also taken on different ranges that could possibly contend with the exclusive one that we are duplicating on the kernel side. So the problem is not fundamental to the approach I think. We just took a shortcut by ignoring the shared locks. For line-of-sight to a path to remove the global lock someday, I think we could make the 2MB granular locks be reader/writer to match the TDX module. Then around the SEAMCALLs that take these locks, we could take them on the kernel side in the right order for whichever SEAMCALL we are making. And that would only be the plan if we wanted to improve scalability someday without changing the TDX module.
On Sat, Sep 27, 2025 at 03:00:31AM +0800, Edgecombe, Rick P wrote: > On Fri, 2025-09-26 at 09:11 -0700, Dave Hansen wrote: > > If it can't return failure then the _only_ other option is to spin. > > Right? > > Yea, but you could spin around the SEAMCALL or you could spin on > duplicate locks on the kernel side before making the SEAMCALL. Or put > more generally, you could prevent contention before you make the > SEACMALL. KVM does this also by kicking vCPUs out of the TDX module via > IPI in other cases. > > > > > I understand the reluctance to have such a nasty spin loop. But other > > than reworking the KVM code to do the retries at a higher level, > > Re-working KVM code would be tough, although teaching KVM to fail zap > calls has come up before for TDX/gmem interactions. It was looked at > and decided to be too complex. Now I guess the benefit side of the > equation changes a little bit, but doing it only for TDX might still be > a bridge to far. > > Unless anyone is holding onto another usage that might want this? > > > is there another option? > > I don't see why we can't just duplicate the locking in a more matching > way on the kernel side. Before the plan to someday drop the global lock > if needed, was to switch to 2MB granular locks to match the TDX > module's exclusive lock internal behavior. > > What Yan is basically pointing out is that there are shared locks that > are also taken on different ranges that could possibly contend with the > exclusive one that we are duplicating on the kernel side. > > So the problem is not fundamental to the approach I think. We just took > a shortcut by ignoring the shared locks. For line-of-sight to a path to > remove the global lock someday, I think we could make the 2MB granular > locks be reader/writer to match the TDX module. Then around the > SEAMCALLs that take these locks, we could take them on the kernel side > in the right order for whichever SEAMCALL we are making. Not sure if that would work. In the following scenario, where (a) adds PAMT pages B1, xx1 for A1's 2MB physical range. (b) adds PAMT pages A2, xx2 for B2's 2MB physical range. A1, B2 are not from the same 2MB physical range, A1, A2 are from the same 2MB physical range. B1, B2 are from the same 2MB physical range. Physical addresses of xx1, xx2 are irrelevant. CPU 0 CPU 1 --------------------------------- ----------------------------- write_lock(&rwlock-of-range-A1); write_lock(&rwlock-of-range-B2); read_lock(&rwlock-of-range-B1); read_lock(&rwlock-of-range-A2); ... ... (a) TDH.PHYMEM.PAMT.ADD(A1, B1, xx1) (b) TDH.PHYMEM.PAMT.ADD(B2, A2, xx2) ... ... read_unlock(&rwlock-of-range-B1); read_unlock(&rwlock-of-range-A2); write_unlock(&rwlock-of-range-A1); write_unlock(&rwlock-of-range-B2); To match the reader/writer locks in the TDX module, it looks like we may encounter an AB-BA lock issue. Do you have any suggestions for a better approach? e.g., could the PAMT pages be allocated from a dedicated pool that ensures they reside in different 2MB ranges from guest private pages and TD control pages? > And that would only be the plan if we wanted to improve scalability > someday without changing the TDX module.
On Sun, Sep 28, 2025 at 09:34:14AM +0800, Yan Zhao wrote: > On Sat, Sep 27, 2025 at 03:00:31AM +0800, Edgecombe, Rick P wrote: > > On Fri, 2025-09-26 at 09:11 -0700, Dave Hansen wrote: > > > If it can't return failure then the _only_ other option is to spin. > > > Right? > > > > Yea, but you could spin around the SEAMCALL or you could spin on > > duplicate locks on the kernel side before making the SEAMCALL. Or put > > more generally, you could prevent contention before you make the > > SEACMALL. KVM does this also by kicking vCPUs out of the TDX module via > > IPI in other cases. > > > > > > > > I understand the reluctance to have such a nasty spin loop. But other > > > than reworking the KVM code to do the retries at a higher level, > > > > Re-working KVM code would be tough, although teaching KVM to fail zap > > calls has come up before for TDX/gmem interactions. It was looked at > > and decided to be too complex. Now I guess the benefit side of the > > equation changes a little bit, but doing it only for TDX might still be > > a bridge to far. > > > > Unless anyone is holding onto another usage that might want this? > > > > > is there another option? > > > > I don't see why we can't just duplicate the locking in a more matching > > way on the kernel side. Before the plan to someday drop the global lock > > if needed, was to switch to 2MB granular locks to match the TDX > > module's exclusive lock internal behavior. > > > > What Yan is basically pointing out is that there are shared locks that > > are also taken on different ranges that could possibly contend with the > > exclusive one that we are duplicating on the kernel side. > > > > So the problem is not fundamental to the approach I think. We just took > > a shortcut by ignoring the shared locks. For line-of-sight to a path to > > remove the global lock someday, I think we could make the 2MB granular > > locks be reader/writer to match the TDX module. Then around the > > SEAMCALLs that take these locks, we could take them on the kernel side > > in the right order for whichever SEAMCALL we are making. > Not sure if that would work. > > In the following scenario, where > (a) adds PAMT pages B1, xx1 for A1's 2MB physical range. > (b) adds PAMT pages A2, xx2 for B2's 2MB physical range. > > A1, B2 are not from the same 2MB physical range, > A1, A2 are from the same 2MB physical range. > B1, B2 are from the same 2MB physical range. > Physical addresses of xx1, xx2 are irrelevant. > > > CPU 0 CPU 1 > --------------------------------- ----------------------------- > write_lock(&rwlock-of-range-A1); write_lock(&rwlock-of-range-B2); > read_lock(&rwlock-of-range-B1); read_lock(&rwlock-of-range-A2); > ... ... > (a) TDH.PHYMEM.PAMT.ADD(A1, B1, xx1) (b) TDH.PHYMEM.PAMT.ADD(B2, A2, xx2) > ... ... > read_unlock(&rwlock-of-range-B1); read_unlock(&rwlock-of-range-A2); > write_unlock(&rwlock-of-range-A1); write_unlock(&rwlock-of-range-B2); > > > To match the reader/writer locks in the TDX module, it looks like we may > encounter an AB-BA lock issue. > > Do you have any suggestions for a better approach? > > e.g., could the PAMT pages be allocated from a dedicated pool that ensures they > reside in different 2MB ranges from guest private pages and TD control pages? It can work: allocate 2M a time for PAMT and piecemeal it to TDX module as needed. But it means if 2M allocation is failed, TDX is not functional. Maybe just use a dedicated kmem_cache for PAMT allocations. Although, I am not sure if there's a way to specify to kmem_cache what pages to ask from page allocator. -- Kiryl Shutsemau / Kirill A. Shutemov
On 9/29/25 04:17, Kiryl Shutsemau wrote: >> Do you have any suggestions for a better approach? >> >> e.g., could the PAMT pages be allocated from a dedicated pool that ensures they >> reside in different 2MB ranges from guest private pages and TD control pages? > It can work: allocate 2M a time for PAMT and piecemeal it to TDX module > as needed. But it means if 2M allocation is failed, TDX is not functional. > > Maybe just use a dedicated kmem_cache for PAMT allocations. Although, I > am not sure if there's a way to specify to kmem_cache what pages to ask > from page allocator. That seems a bit obtuse rather than just respecting normal lock ordering rules. No?
On Mon, 2025-09-29 at 09:22 -0700, Dave Hansen wrote: > On 9/29/25 04:17, Kiryl Shutsemau wrote: > > > Do you have any suggestions for a better approach? > > > > > > e.g., could the PAMT pages be allocated from a dedicated pool that ensures they > > > reside in different 2MB ranges from guest private pages and TD control pages? I guess the TDX module can get away with doing them in whatever random order because they are try-locks. Perhaps we could take them in PFN order? If need be the TDX module could do the same. > > It can work: allocate 2M a time for PAMT and piecemeal it to TDX module > > as needed. But it means if 2M allocation is failed, TDX is not functional. > > > > Maybe just use a dedicated kmem_cache for PAMT allocations. Although, I > > am not sure if there's a way to specify to kmem_cache what pages to ask > > from page allocator. > > That seems a bit obtuse rather than just respecting normal lock ordering > rules. No? It might serve a purpose of proving that scalability is possible. I was thinking if we had line of sight to improving it we could go with the "simple" solution and wait until there is a problem. Is it reasonable? But locking issues and state duplication between the TDX module and kernel are a recurring pattern. The similar KVM MMU issue was probably the most contentious design issue we had in the TDX base enabling together with the TD configuration API. To me DPAMT is just more proof that this is not a rare conflict, but a fundamental problem. If we do need to find a way to improve the DPAMT scalability, I want to say we should actually think about the generic case and try to solve that. But that needs much more thought...
On Mon, 2025-09-29 at 09:58 -0700, Rick Edgecombe wrote: > It might serve a purpose of proving that scalability is possible. I was thinking > if we had line of sight to improving it we could go with the "simple" solution > and wait until there is a problem. Is it reasonable? We discussed this offline and the conclusion was to just keep a global reader/writer lock as the backpocket solution. So DPAMT seamcalls take the global reader, and if they get a BUSY code, they take the global lock as write and retry. Similar to what KVM does on a per-VM scope, but globally and reduced by the refcount. But until there is a measured issue, just keep the simpler global exclusive lock with the refcount to start. So unless anyone new chimes in, we can call this closed.
On 9/26/25 12:00, Edgecombe, Rick P wrote: > So the problem is not fundamental to the approach I think. We just took > a shortcut by ignoring the shared locks. For line-of-sight to a path to > remove the global lock someday, I think we could make the 2MB granular > locks be reader/writer to match the TDX module. Then around the > SEAMCALLs that take these locks, we could take them on the kernel side > in the right order for whichever SEAMCALL we are making. Yeah, but what if the TDX module changes its locking scheme in 5 years or 10? What happens to 6.17.9999-stable?
On Fri, 2025-09-26 at 12:03 -0700, Dave Hansen wrote: > > Yeah, but what if the TDX module changes its locking scheme in 5 > years > or 10? What happens to 6.17.9999-stable? Yea, we expected we were turning the locking behavior into ABI even in the existing S-EPT management code. To change it would require a host opt-in. We originally assured correctness via TDX module source code inspection, but recently the TDX module folks pointed out that the TDX module ABI docs actually list information about what locks are taken for each SEAMCALL. Or in it's terms, which resources are held in which context. It is in the tables labeled "Concurrency Restrictions". Lock ordering (like we would need to do the above) is not listed though.
© 2016 - 2025 Red Hat, Inc.