Documentation/arch/x86/tdx.rst | 108 ++++++ arch/x86/coco/tdx/tdx.c | 6 +- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/include/asm/set_memory.h | 3 + arch/x86/include/asm/tdx.h | 40 ++- arch/x86/include/asm/tdx_errno.h | 96 +++++ arch/x86/include/asm/tdx_global_metadata.h | 1 + arch/x86/kvm/mmu/mmu.c | 7 + arch/x86/kvm/vmx/tdx.c | 102 ++++-- arch/x86/kvm/vmx/tdx.h | 1 - arch/x86/kvm/vmx/tdx_errno.h | 40 --- arch/x86/mm/Makefile | 2 + arch/x86/mm/meminfo.c | 11 + arch/x86/mm/pat/set_memory.c | 2 +- arch/x86/virt/vmx/tdx/tdx.c | 380 +++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 5 +- arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 + virt/kvm/kvm_main.c | 1 + 18 files changed, 702 insertions(+), 108 deletions(-) create mode 100644 arch/x86/include/asm/tdx_errno.h delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h create mode 100644 arch/x86/mm/meminfo.c
This patchset enables Dynamic PAMT in TDX. Please review. Previously, we thought it can get upstreamed after huge page support, but huge pages require support on guestmemfd side which might take time to hit upstream. Dynamic PAMT doesn't have dependencies. The patchset can be found here: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt ========================================================================== The Physical Address Metadata Table (PAMT) holds TDX metadata for physical memory and must be allocated by the kernel during TDX module initialization. The exact size of the required PAMT memory is determined by the TDX module and may vary between TDX module versions, but currently it is approximately 0.4% of the system memory. This is a significant commitment, especially if it is not known upfront whether the machine will run any TDX guests. The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and PAMT_2M levels are still allocated on TDX module initialization, but the PAMT_4K level is allocated dynamically, reducing static allocations to approximately 0.004% of the system memory. PAMT memory is dynamically allocated as pages gain TDX protections. It is reclaimed when TDX protections have been removed from all pages in a contiguous area. Dynamic PAMT support in TDX module ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as needed. PAMT_1G and PAMT_2M are still allocated statically at the time of TDX module initialization. At init stage allocation of PAMT_4K is replaced with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k. VMM is responsible for allocating and freeing PAMT_4K. There's a couple of new SEAMCALLs for this: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They add/remove PAMT memory in form of page pair. There's no requirement for these pages to be contiguous. Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region. It allows any 4K from the region to be usable by TDX module. With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT memory (TDX_MISSING_PAMT_PAGE_PAIR): - TDH.MNG.CREATE - TDH.MNG.ADDCX - TDH.VP.ADDCX - TDH.VP.CREATE - TDH.MEM.PAGE.ADD - TDH.MEM.PAGE.AUG - TDH.MEM.PAGE.DEMOTE - TDH.MEM.PAGE.RELOCATE Basically, if you supply memory to a TD, this memory has to backed by PAMT memory. Once no TD uses the 2M range, the PAMT page pair can be reclaimed with TDH.PHYMEM.PAMT.REMOVE. TDX module track PAMT memory usage and can give VMM a hint that PAMT memory can be removed. Such hint is provided from all SEAMCALLs that removes memory from TD: - TDH.MEM.SEPT.REMOVE - TDH.MEM.PAGE.REMOVE - TDH.MEM.PAGE.PROMOTE - TDH.MEM.PAGE.RELOCATE - TDH.PHYMEM.PAGE.RECLAIM With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer needed PAMT page pair. PAMT memory is global resource and not tied to a specific TD. TDX modules maintains PAMT memory in a radix tree addressed by physical address. Each entry in the tree can be locked with shared or exclusive lock. Any modification of the tree requires exclusive lock. Any SEAMCALL that takes explicit HPA as an argument will walk the tree taking shared lock on entries. It required to make sure that the page pointed by HPA is of compatible type for the usage. TDCALLs don't take PAMT locks as none of the take HPA as an argument. Dynamic PAMT enabling in kernel ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Kernel maintains refcounts for every 2M regions with two helpers tdx_pamt_get() and tdx_pamt_put(). The refcount represents number of users for the PAMT memory in the region. Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and TDH.PHYMEM.PAMT.REMOVE on transition 1->0. The function tdx_alloc_page() allocates a new page and ensures that it is backed by PAMT memory. Pages allocated in this manner are ready to be used for a TD. The function tdx_free_page() frees the page and releases the PAMT memory for the 2M region if it is no longer needed. PAMT memory gets allocated as part of TD init, VCPU init, on populating SEPT tree and adding guest memory (both during TD build and via AUG on accept). Splitting 2M page into 4K also requires PAMT memory. PAMT memory removed on reclaim of control pages and guest memory. Populating PAMT memory on fault and on split is tricky as kernel cannot allocate memory from the context where it is needed. These code paths use pre-allocated PAMT memory pools. Previous attempt on Dynamic PAMT enabling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The initial attempt at kernel enabling was quite different. It was built around lazy PAMT allocation: only trying to add a PAMT page pair if a SEAMCALL fails due to a missing PAMT and reclaiming it based on hints provided by the TDX module. The motivation was to avoid duplicating the PAMT memory refcounting that the TDX module does on the kernel side. This approach is inherently more racy as there is no serialization of PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD. Such serialization would require global locking, which is not feasible. This approach worked, but at some point it became clear that it could not be robust as long as the kernel avoids TDX_OPERAND_BUSY loops. TDX_OPERAND_BUSY will occur as a result of the races mentioned above. This approach was abandoned in favor of explicit refcounting. v2: - Drop phys_prepare/clenup. Use kvm_get_running_vcpu() to reach per-VCPU PAMT memory pool from TDX code instead. - Move code that allocates/frees PAMT out of KVM; - Allocate refcounts per-memblock, not per-TDMR; - Fix free_pamt_metadata() for machines without Dynamic PAMT; - Fix refcounting in tdx_pamt_put() error path; - Export functions where they are used; - Consolidate TDX error handling code; - Add documentation for Dynamic PAMT; - Mark /proc/meminfo patch [NOT-FOR-UPSTREAM]; Kirill A. Shutemov (12): x86/tdx: Consolidate TDX error handling x86/virt/tdx: Allocate page bitmap for Dynamic PAMT x86/virt/tdx: Allocate reference counters for PAMT memory x86/virt/tdx: Add tdx_alloc/free_page() helpers KVM: TDX: Allocate PAMT memory in __tdx_td_init() KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init() KVM: TDX: Preallocate PAMT pages to be used in page fault path KVM: TDX: Handle PAMT allocation in fault path KVM: TDX: Reclaim PAMT memory [NOT-FOR-UPSTREAM] x86/virt/tdx: Account PAMT memory and print it in /proc/meminfo x86/virt/tdx: Enable Dynamic PAMT Documentation/x86: Add documentation for TDX's Dynamic PAMT Documentation/arch/x86/tdx.rst | 108 ++++++ arch/x86/coco/tdx/tdx.c | 6 +- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/include/asm/set_memory.h | 3 + arch/x86/include/asm/tdx.h | 40 ++- arch/x86/include/asm/tdx_errno.h | 96 +++++ arch/x86/include/asm/tdx_global_metadata.h | 1 + arch/x86/kvm/mmu/mmu.c | 7 + arch/x86/kvm/vmx/tdx.c | 102 ++++-- arch/x86/kvm/vmx/tdx.h | 1 - arch/x86/kvm/vmx/tdx_errno.h | 40 --- arch/x86/mm/Makefile | 2 + arch/x86/mm/meminfo.c | 11 + arch/x86/mm/pat/set_memory.c | 2 +- arch/x86/virt/vmx/tdx/tdx.c | 380 +++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 5 +- arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 + virt/kvm/kvm_main.c | 1 + 18 files changed, 702 insertions(+), 108 deletions(-) create mode 100644 arch/x86/include/asm/tdx_errno.h delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h create mode 100644 arch/x86/mm/meminfo.c -- 2.47.2
On 6/9/25 12:13, Kirill A. Shutemov wrote: > The exact size of the required PAMT memory is determined by the TDX > module and may vary between TDX module versions, but currently it is > approximately 0.4% of the system memory. This is a significant > commitment, especially if it is not known upfront whether the machine > will run any TDX guests. > > The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and > PAMT_2M levels are still allocated on TDX module initialization, but the > PAMT_4K level is allocated dynamically, reducing static allocations to > approximately 0.004% of the system memory. I'm beginning to think that this series is barking up the wrong tree. > 18 files changed, 702 insertions(+), 108 deletions(-) I totally agree that saving 0.4% of memory on a gagillion systems saves a gagillion dollars. But this series seems to be marching down the path that the savings needs to be down at page granularity: If a 2M page doesn't need PAMT then it strictly shouldn't have any PAMT. While that's certainly a high-utiltiy tact, I can't help but think it may be over complicated. What if we just focused on three states: 1. System boots, has no DPAMT. 2. First TD starts up, all DPAMT gets allocated 3. Last TD shuts down, all DPAMT gets freed The cases that leaves behind are when the system has a small number of TDs packed into a relatively small number of 2M pages. That occurs either because they're backing with real huge pages or that they are backed with 4k and nicely compacted because memory wasn't fragmented. I know our uberscaler buddies are quite fond of those cases and want to minimize memory use. But are you folks really going to have that many systems which deploy a very small number of small TDs? In other words, can we simplify this? Or at least _start_ simpler with v1?
On Mon, Sep 08, 2025 at 12:17:58PM -0700, Dave Hansen wrote: > On 6/9/25 12:13, Kirill A. Shutemov wrote: > > The exact size of the required PAMT memory is determined by the TDX > > module and may vary between TDX module versions, but currently it is > > approximately 0.4% of the system memory. This is a significant > > commitment, especially if it is not known upfront whether the machine > > will run any TDX guests. > > > > The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and > > PAMT_2M levels are still allocated on TDX module initialization, but the > > PAMT_4K level is allocated dynamically, reducing static allocations to > > approximately 0.004% of the system memory. > > I'm beginning to think that this series is barking up the wrong tree. > > > 18 files changed, 702 insertions(+), 108 deletions(-) > > I totally agree that saving 0.4% of memory on a gagillion systems saves > a gagillion dollars. > > But this series seems to be marching down the path that the savings > needs to be down at page granularity: If a 2M page doesn't need PAMT > then it strictly shouldn't have any PAMT. While that's certainly a > high-utiltiy tact, I can't help but think it may be over complicated. > > What if we just focused on three states: > > 1. System boots, has no DPAMT. > 2. First TD starts up, all DPAMT gets allocated > 3. Last TD shuts down, all DPAMT gets freed > > The cases that leaves behind are when the system has a small number of > TDs packed into a relatively small number of 2M pages. That occurs > either because they're backing with real huge pages or that they are > backed with 4k and nicely compacted because memory wasn't fragmented. > > I know our uberscaler buddies are quite fond of those cases and want to > minimize memory use. But are you folks really going to have that many > systems which deploy a very small number of small TDs? > > In other words, can we simplify this? Or at least _start_ simpler with v1? You cannot give all DPAMT memory to TDX module at the start of the first TD and forget about it. Well you can, if you give up ever mapping guest memory with 2M pages. Dynamic PAMT pages are stored into PAMT_2M entry and you cannot have 2M page and have Dynamic 4K entries stored there at the same time. You would need to handle at least for promotion/demotion and stash this memory somewhere while 2M pages used. And it is going to be very wasteful. With huge pages, in most cases, you only need dynamic PAMT for control pages. You will have a lot of memory sitting in stash with zero use. -- Kiryl Shutsemau / Kirill A. Shutemov
On 9/9/25 04:16, Kiryl Shutsemau wrote: > Dynamic PAMT pages are stored into PAMT_2M entry and you cannot have 2M > page and have Dynamic 4K entries stored there at the same time. That sounds like a TDX module implementation bug to me. Worst possible case, the TDX module could double the 'sysinfo_tdmr->pamt_2m_entry_size' and use the other half for more pointers. > And it is going to be very wasteful. With huge pages, in most cases, you > only need dynamic PAMT for control pages. You will have a lot of memory > sitting in stash with zero use. I think it's going to be hard to convince me without actual data on this one. Even then, we're talking about 0.4% of system memory. So how much code and complexity are we talking about in order to save a *maximum* of 0.4% of system memory?
On Tue, Sep 9, 2025 at 8:24 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 9/9/25 04:16, Kiryl Shutsemau wrote: > > > And it is going to be very wasteful. With huge pages, in most cases, you > > only need dynamic PAMT for control pages. You will have a lot of memory > > sitting in stash with zero use. > > I think it's going to be hard to convince me without actual data on this > one. > * With 1G page backing and with DPAMT entries created only for 4K EPT mappings - ~5MB of DPAMT memory usage for 704G guest memory size. We expect the DPAMT memory usage to be in MBs even with 4096G guest memory size. * With DPAMT entries created for all private memory irrespective of mapping granularity - DPAMT memory usage is around ~3GB for 704G guest memory size and around ~16G for 4096G guest memory size. For a 4TB guest memory size with 1G page backing, the DPAMT memory usage > Even then, we're talking about 0.4% of system memory. So how much code > and complexity are we talking about in order to save a *maximum* of 0.4% > of system memory? >
On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote: > Dynamic PAMT enabling in kernel > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Kernel maintains refcounts for every 2M regions with two helpers > tdx_pamt_get() and tdx_pamt_put(). > > The refcount represents number of users for the PAMT memory in the region. > Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and > TDH.PHYMEM.PAMT.REMOVE on transition 1->0. > > The function tdx_alloc_page() allocates a new page and ensures that it is > backed by PAMT memory. Pages allocated in this manner are ready to be used > for a TD. The function tdx_free_page() frees the page and releases the > PAMT memory for the 2M region if it is no longer needed. > > PAMT memory gets allocated as part of TD init, VCPU init, on populating > SEPT tree and adding guest memory (both during TD build and via AUG on > accept). Splitting 2M page into 4K also requires PAMT memory. > > PAMT memory removed on reclaim of control pages and guest memory. > > Populating PAMT memory on fault and on split is tricky as kernel cannot > allocate memory from the context where it is needed. These code paths use > pre-allocated PAMT memory pools. > > Previous attempt on Dynamic PAMT enabling > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The initial attempt at kernel enabling was quite different. It was built > around lazy PAMT allocation: only trying to add a PAMT page pair if a > SEAMCALL fails due to a missing PAMT and reclaiming it based on hints > provided by the TDX module. > > The motivation was to avoid duplicating the PAMT memory refcounting that > the TDX module does on the kernel side. > > This approach is inherently more racy as there is no serialization of > PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD. > Such serialization would require global locking, which is not feasible. > > This approach worked, but at some point it became clear that it could not > be robust as long as the kernel avoids TDX_OPERAND_BUSY loops. > TDX_OPERAND_BUSY will occur as a result of the races mentioned above. > > This approach was abandoned in favor of explicit refcounting. On closer inspection this new solution also has global locking. It opportunistically checks to see if there is already a refcount, but otherwise when a PAMT page actually has to be added there is a global spinlock while the PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere around once per 512 4k private pages, but when it does it has some less ideal properties: - Cache line bouncing of the lock between all TDs on the host - An global exclusive lock deep inside the TDP MMU shared lock fault path - Contend heavily when two TD's shutting down at the same time? As for why not only do the lock as a backup option like the kick+lock solution in KVM, the problem would be losing the refcount race and ending up with a PAMT page getting released early. As far as TDX module locking is concerned (i.e. BUSY error codes from pamt add/remove), it seems this would only happen if pamt add/remove operate simultaneously on the same 2MB HPA region. That is completely prevented by the refcount and global lock, but it's a bit heavyweight. It prevents simultaneously adding totally separate 2MB regions when we only would need to prevent simultaneously operating on the same 2MB region. I don't see any other reason for the global spin lock, Kirill was that it? Did you consider also adding a lock per 2MB region, like the refcount? Or any other granularity of lock besides global? Not saying global is definitely the wrong choice, but seems arbitrary if I got the above right.
On Fri, Aug 08, 2025 at 11:18:40PM +0000, Edgecombe, Rick P wrote: > On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote: > > Dynamic PAMT enabling in kernel > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > Kernel maintains refcounts for every 2M regions with two helpers > > tdx_pamt_get() and tdx_pamt_put(). > > > > The refcount represents number of users for the PAMT memory in the region. > > Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and > > TDH.PHYMEM.PAMT.REMOVE on transition 1->0. > > > > The function tdx_alloc_page() allocates a new page and ensures that it is > > backed by PAMT memory. Pages allocated in this manner are ready to be used > > for a TD. The function tdx_free_page() frees the page and releases the > > PAMT memory for the 2M region if it is no longer needed. > > > > PAMT memory gets allocated as part of TD init, VCPU init, on populating > > SEPT tree and adding guest memory (both during TD build and via AUG on > > accept). Splitting 2M page into 4K also requires PAMT memory. > > > > PAMT memory removed on reclaim of control pages and guest memory. > > > > Populating PAMT memory on fault and on split is tricky as kernel cannot > > allocate memory from the context where it is needed. These code paths use > > pre-allocated PAMT memory pools. > > > > Previous attempt on Dynamic PAMT enabling > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > The initial attempt at kernel enabling was quite different. It was built > > around lazy PAMT allocation: only trying to add a PAMT page pair if a > > SEAMCALL fails due to a missing PAMT and reclaiming it based on hints > > provided by the TDX module. > > > > The motivation was to avoid duplicating the PAMT memory refcounting that > > the TDX module does on the kernel side. > > > > This approach is inherently more racy as there is no serialization of > > PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD. > > Such serialization would require global locking, which is not feasible. > > > > This approach worked, but at some point it became clear that it could not > > be robust as long as the kernel avoids TDX_OPERAND_BUSY loops. > > TDX_OPERAND_BUSY will occur as a result of the races mentioned above. > > > > This approach was abandoned in favor of explicit refcounting. > > On closer inspection this new solution also has global locking. It > opportunistically checks to see if there is already a refcount, but otherwise > when a PAMT page actually has to be added there is a global spinlock while the > PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere > around once per 512 4k private pages, but when it does it has some less ideal > properties: > - Cache line bouncing of the lock between all TDs on the host > - An global exclusive lock deep inside the TDP MMU shared lock fault path > - Contend heavily when two TD's shutting down at the same time? > > As for why not only do the lock as a backup option like the kick+lock solution > in KVM, the problem would be losing the refcount race and ending up with a PAMT > page getting released early. > > As far as TDX module locking is concerned (i.e. BUSY error codes from pamt > add/remove), it seems this would only happen if pamt add/remove operate > simultaneously on the same 2MB HPA region. That is completely prevented by the > refcount and global lock, but it's a bit heavyweight. It prevents simultaneously > adding totally separate 2MB regions when we only would need to prevent > simultaneously operating on the same 2MB region. > > I don't see any other reason for the global spin lock, Kirill was that it? Did > you consider also adding a lock per 2MB region, like the refcount? Or any other > granularity of lock besides global? Not saying global is definitely the wrong > choice, but seems arbitrary if I got the above right. We have discussed this before[1]. Global locking is problematic when you actually hit contention. Let's not complicate things until we actually see it. I failed to demonstrate contention without huge pages. With huge pages it is even more dubious that we ever see it. [1] https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/ -- Kiryl Shutsemau / Kirill A. Shutemov
On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote: > > I don't see any other reason for the global spin lock, Kirill was that it? > > Did > > you consider also adding a lock per 2MB region, like the refcount? Or any > > other > > granularity of lock besides global? Not saying global is definitely the > > wrong > > choice, but seems arbitrary if I got the above right. > > We have discussed this before[1]. Global locking is problematic when you > actually hit contention. Let's not complicate things until we actually > see it. I failed to demonstrate contention without huge pages. With huge > pages it is even more dubious that we ever see it. > > [1] > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/ Ah, I see. I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge pages) and then shutting them down. I saw 701 contentions on startup, and 53 more on shutdown. Total wait time 2ms. Not horrible but not theoretical either. But it probably wasn't much of a cacheline bouncing worse case. And I guess this is on my latest changes not this exact v2, but it shouldn't have changed. But hmm, it seems Dave's objection about maintaining the lock allocations would apply to the refcounts too? But the hotplug concerns shouldn't actually be an issue for TDX because they gets rejected if the allocations are not already there. So complexity of a per-2MB lock should be minimal, at least incrementally. The difference seems more about memory use vs performance. What gives me pause is in the KVM TDX work we have really tried hard to not take exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by hard numbers. But an enormous amount of work went into lettings KVM faults happen under the shared lock for normal VMs. So on one hand, yes it's premature optimization. But on the other hand, it's a maintainability concern about polluting the existing way things work in KVM with special TDX properties. I think we need to at least call out loudly that the decision was to go with the simplest possible solution, and the impact to KVM. I'm not sure what Sean's opinion is, but I wouldn't want him to first learn of it when he went digging and found a buried global spin lock in the fault path.
On Mon, Aug 11, 2025, Rick P Edgecombe wrote: > On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote: > > > I don't see any other reason for the global spin lock, Kirill was that > > > it? Did you consider also adding a lock per 2MB region, like the > > > refcount? Or any other granularity of lock besides global? Not saying > > > global is definitely the wrong choice, but seems arbitrary if I got the > > > above right. > > > > We have discussed this before[1]. Global locking is problematic when you > > actually hit contention. Let's not complicate things until we actually > > see it. I failed to demonstrate contention without huge pages. With huge > > pages it is even more dubious that we ever see it. > > > > [1] > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/ > > Ah, I see. > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge How many vCPUs? And were the VMs actually accepting/faulting all 16GiB? There's also a noisy neighbor problem lurking. E.g. malicious/buggy VM spams private<=>shared conversions and thus interferes with PAMT allocations for other VMs. > pages) and then shutting them down. I saw 701 contentions on startup, and 53 > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either. > But it probably wasn't much of a cacheline bouncing worse case. Isn't the SEAMCALL done while holding the spinlock? I assume the latency of the SEAMCALL is easily the long pole in the flow. > And I guess this is on my latest changes not this exact v2, but it shouldn't > have changed. > > But hmm, it seems Dave's objection about maintaining the lock allocations would > apply to the refcounts too? But the hotplug concerns shouldn't actually be an > issue for TDX because they gets rejected if the allocations are not already > there. So complexity of a per-2MB lock should be minimal, at least > incrementally. The difference seems more about memory use vs performance. > > What gives me pause is in the KVM TDX work we have really tried hard to not take > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by > hard numbers. Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock for write is problematic. Even if TDX VMs don't exhibit the same patterns *today* as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees that will always hold true. > But an enormous amount of work went into lettings KVM faults happen under the > shared lock for normal VMs. So on one hand, yes it's premature optimization. > But on the other hand, it's a maintainability concern about polluting the > existing way things work in KVM with special TDX properties. > > I think we need to at least call out loudly that the decision was to go with the > simplest possible solution, and the impact to KVM. I'm not sure what Sean's > opinion is, but I wouldn't want him to first learn of it when he went digging > and found a buried global spin lock in the fault path. Heh, too late, I saw it when this was first posted. And to be honest, my initial reaction was very much "absolutely not" (though Rated R, not PG). Now that I've had time to think things through, I'm not _totally_ opposed to having a spinlock in the page fault path, but my overall sentiment remains the same. For mmu_lock and related SPTE operations, I was super adamant about not taking exclusive locks because based on our experience with the TDP MMU, converting flows from exclusive to shared is usually significantly more work than developing code for "shared mode" straightaway (and you note above, that wasn't trivial for TDX). And importantly, those code paths were largely solved problems. I.e. I didn't want to get into a situation where TDX undid the parallelization of the TDP MMU, and then had to add it back after the fact. I think the same holds true here. I'm not completely opposed to introducing a spinlock, but I want to either have a very high level of confidence that the lock won't introduce jitter/delay (I have low confidence on this front, at least in the proposed patches), or have super clear line of sight to making the contention irrelevant, without having to rip apart the code. My biggest question at this point is: why is all of this being done on-demand? IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K pages at the last possible moment". Neither of those seems ideal. E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand PAMT management seems reasonable. But for PAMTs that are used to track guest-assigned memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd? I.e. setup PAMT crud when guest_memfd is populated, not when the memory is mapped into the guest. That way setups that cares about guest boot time can preallocate guest_memfd in order to get the PAMT stuff out of the way. You could do the same thing by prefaulting guest memory, but TDX has limitations there, and I see very little value in precisely reclaiming PAMT memory when a leaf S-EPT is zapped, i.e. when a page is converted from private=>shared. As above, that's just asking for noisy neighbor issues. The complaints with static PAMT are that it required burning 0.4% of memory even if the host isn't actively running TDX VMs. Burning 0.4% of the memory assigned to a guest, regardless of whether it's map private or shared, seems acceptable, and I think would give us a lot more flexibility in avoiding locking issues. Similarly, we could bind a PAMT to non-leaf S-EPT pages during mmu_topup_memory_caches(), i.e. when arch.mmu_external_spt_cache is filled. Then there would be no need for a separate vcpu->arch.pamt_page_cache, and more work would be done outside of mmu_lock. Freeing SPTs would still be done under mmu_lock (I think), but that should be a much rarer operation.
On Mon, 2025-08-11 at 19:02 -0700, Sean Christopherson wrote: > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non > > huge > > How many vCPUs? And were the VMs actually accepting/faulting all 16GiB? 4 vCPUs. I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory from 4 threads until OOM, then shutdown. TDs were split between two sockets. It ended up with 1136 contentions of the global lock, 4ms waiting. It still feels very wrong, but also not something that is a very measurable in the real world. Like Kirill was saying, the global lock is not held very long. I'm not sure if this may still hit scalability issues from cacheline bouncing on bigger multisocket systems. But we do have a path forwards if we hit this. Depending on the scale of the problem that comes up we could decide whether to do the lock per-2MB region with more memory usage, or a hashed table of N locks like Dave suggested. So I'll plan to keep the existing single lock for now unless anyone has any strong objections. > > There's also a noisy neighbor problem lurking. E.g. malicious/buggy VM spams > private<=>shared conversions and thus interferes with PAMT allocations for > other > VMs. Hmm, as long as it doesn't block it completely, it seems ok?
On 8/13/25 15:43, Edgecombe, Rick P wrote: > I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory > from 4 threads until OOM, then shutdown. TDs were split between two sockets. It > ended up with 1136 contentions of the global lock, 4ms waiting. 4ms out of how much CPU time? Also, contention is *NOT* necessarily bad here. Only _false_ contention. The whole point of the lock is to ensure that there aren't two different CPUs trying to do two different things to the same PAMT range at the same time. If there are, one of them *HAS* to wait. It can wait lots of different ways, but it has to wait. That wait will show up as spinlock contention. Even if the global lock went away, that 4ms of spinning might still be there.
On Wed, 2025-08-13 at 16:31 -0700, Dave Hansen wrote: > On 8/13/25 15:43, Edgecombe, Rick P wrote: > > I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory > > from 4 threads until OOM, then shutdown. TDs were split between two sockets. It > > ended up with 1136 contentions of the global lock, 4ms waiting. > > 4ms out of how much CPU time? The whole test took about 60s wall time (minus the time of some manual steps). I'll have to automate it a bit more. But 4ms seemed safely in the "small" category. > > Also, contention is *NOT* necessarily bad here. Only _false_ contention. > > The whole point of the lock is to ensure that there aren't two different > CPUs trying to do two different things to the same PAMT range at the > same time. > > If there are, one of them *HAS* to wait. It can wait lots of different > ways, but it has to wait. That wait will show up as spinlock contention. > > Even if the global lock went away, that 4ms of spinning might still be > there. I assumed it was mostly real contention because the the refcount check outside the lock should prevent the majority of "two threads operating on the same 2MB region" collisions. The code is roughly: 1: if (atomic_inc_not_zero(2mb_pamt_refcount)) return <it's mapped>; 2: <global lock> if (atomic_read(2mb_pamt_refcount) != 0) { 3: atomic_inc(2mb_pamt_refcount); <global unlock> return <it's mapped>; } <seamcall> <global unlock> 4: (similar pattern on the unmapping) So it will only be valid contention if two threads try to fault in the *same* 2MB DPAMT region *and* lose that race around 1-3, but invalid contention if threads try to execute 2-4 at the same time for any different 2MB regions. Let me go verify.
On Thu, Aug 14, 2025 at 12:14:40AM +0000, Edgecombe, Rick P wrote: > On Wed, 2025-08-13 at 16:31 -0700, Dave Hansen wrote: > > On 8/13/25 15:43, Edgecombe, Rick P wrote: > > > I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory > > > from 4 threads until OOM, then shutdown. TDs were split between two sockets. It > > > ended up with 1136 contentions of the global lock, 4ms waiting. > > > > 4ms out of how much CPU time? > > The whole test took about 60s wall time (minus the time of some manual steps). > I'll have to automate it a bit more. But 4ms seemed safely in the "small" > category. > > > > > Also, contention is *NOT* necessarily bad here. Only _false_ contention. > > > > The whole point of the lock is to ensure that there aren't two different > > CPUs trying to do two different things to the same PAMT range at the > > same time. > > > > If there are, one of them *HAS* to wait. It can wait lots of different > > ways, but it has to wait. That wait will show up as spinlock contention. > > > > Even if the global lock went away, that 4ms of spinning might still be > > there. > > I assumed it was mostly real contention because the the refcount check outside > the lock should prevent the majority of "two threads operating on the same 2MB > region" collisions. The code is roughly: > > 1: > if (atomic_inc_not_zero(2mb_pamt_refcount)) > return <it's mapped>; > 2: > <global lock> > if (atomic_read(2mb_pamt_refcount) != 0) { > 3: > atomic_inc(2mb_pamt_refcount); > <global unlock> > return <it's mapped>; > } > <seamcall> > <global unlock> > 4: > > (similar pattern on the unmapping) > > So it will only be valid contention if two threads try to fault in the *same* 2MB > DPAMT region *and* lose that race around 1-3, but invalid contention if threads try > to execute 2-4 at the same time for any different 2MB regions. > > Let me go verify. Note that in absence of the global lock here, concurrent PAMT.ADD would also trigger some cache bouncing during pamt_walk() on taking shared lock on 1G PAMT entry and exclusive lock on 2M entries in the same cache (4 PAMT_2M entries per cache line). This is hidden by the global lock. You would not recover full contention time by removing the global lock. -- Kiryl Shutsemau / Kirill A. Shutemov
On Thu, 2025-08-14 at 11:55 +0100, Kiryl Shutsemau wrote: > > > > (similar pattern on the unmapping) > > > > > > > > So it will only be valid contention if two threads try to fault in the > > > > > > *same* 2MB > > > > DPAMT region *and* lose that race around 1-3, but invalid contention if > > > > > > threads try > > > > to execute 2-4 at the same time for any different 2MB regions. > > > > > > > > Let me go verify. It lost the race only once over a couple runs. So it seems mostly invalid contention. > > > > Note that in absence of the global lock here, concurrent PAMT.ADD would > > also trigger some cache bouncing during pamt_walk() on taking shared > > lock on 1G PAMT entry and exclusive lock on 2M entries in the same > > cache (4 PAMT_2M entries per cache line). This is hidden by the global > > lock. > > > > You would not recover full contention time by removing the global lock. Hmm, yea. Another consideration is that performance sensitive users will probably be using huge pages, in which case 4k PAMT will be mostly skipped. But man, the number and complexity of the locks is getting a bit high across the whole stack. I don't have any easy ideas.
On Fri, Aug 15, 2025, Rick P Edgecombe wrote: > On Thu, 2025-08-14 at 11:55 +0100, Kiryl Shutsemau wrote: > > > > > (similar pattern on the unmapping) > > > > > > > > > > So it will only be valid contention if two threads try to fault in the > > > > > > > *same* 2MB > > > > > DPAMT region *and* lose that race around 1-3, but invalid contention if > > > > > > > threads try > > > > > to execute 2-4 at the same time for any different 2MB regions. > > > > > > > > > > Let me go verify. > > It lost the race only once over a couple runs. So it seems mostly invalid > contention. > > > > > > > Note that in absence of the global lock here, concurrent PAMT.ADD would > > > also trigger some cache bouncing during pamt_walk() on taking shared > > > lock on 1G PAMT entry and exclusive lock on 2M entries in the same > > > cache (4 PAMT_2M entries per cache line). This is hidden by the global > > > lock. > > > > > > You would not recover full contention time by removing the global lock. > > Hmm, yea. Another consideration is that performance sensitive users will > probably be using huge pages, in which case 4k PAMT will be mostly skipped. > > But man, the number and complexity of the locks is getting a bit high across the > whole stack. I don't have any easy ideas. FWIW, I'm not concerned about bouncing cachelines, I'm concerned about the cost of the SEAMCALLs. The latency due to bouncing a cache line due to "false" contention is probably in the noise compared to waiting thousands of cycles for other SEAMCALLs to complete. That's also my concern with tying PAMT management to S-EPT population. E.g. if a use case triggers a decent amount S-EPT churn, then dynamic PAMT support will exacerbate the S-EPT overhead. But IIUC, that's a limitation of the TDX-Module design, i.e. there's no way to hand it a pool of PAMT pages to manage. And I suppose if a use case is churning S-EPT, then it's probably going to be sad no matter what. So, as long as the KVM side of things isn't completely awful, I can live with on-demand PAMT management. As for the global lock, I don't really care what we go with for initial support, just so long as there's clear line of sight to an elegant solution _if_ we need shard the lock.
On Wed, 2025-08-20 at 08:31 -0700, Sean Christopherson wrote: > > But man, the number and complexity of the locks is getting a bit high across > > the whole stack. I don't have any easy ideas. > > FWIW, I'm not concerned about bouncing cachelines, I'm concerned about the > cost of the SEAMCALLs. The latency due to bouncing a cache line due to > "false" contention is probably in the noise compared to waiting thousands of > cycles for other SEAMCALLs to complete. > > That's also my concern with tying PAMT management to S-EPT population. E.g. > if a use case triggers a decent amount S-EPT churn, then dynamic PAMT support > will exacerbate the S-EPT overhead. I confirmed matching the page size is currently required. Having it work with mismatched page sizes was considered, but assessed to require more memory use. As in more pages needed per 2MB region, not just more memory usage due to the pre-allocation of all memory. We can do it if we prefer the simplicity over memory usage. > > But IIUC, that's a limitation of the TDX-Module design, i.e. there's no way to > hand it a pool of PAMT pages to manage. And I suppose if a use case is > churning S-EPT, then it's probably going to be sad no matter what. So, as > long as the KVM side of things isn't completely awful, I can live with on- > demand PAMT management. > > As for the global lock, I don't really care what we go with for initial > support, just so long as there's clear line of sight to an elegant solution > _if_ we need shard the lock. Ok, I'll leave it and we can look at whether the KVM side is simple enough. Thanks for circling back.
On Mon, Aug 11, 2025 at 07:02:10PM -0700, Sean Christopherson wrote: > On Mon, Aug 11, 2025, Rick P Edgecombe wrote: > > On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote: > > > > I don't see any other reason for the global spin lock, Kirill was that > > > > it? Did you consider also adding a lock per 2MB region, like the > > > > refcount? Or any other granularity of lock besides global? Not saying > > > > global is definitely the wrong choice, but seems arbitrary if I got the > > > > above right. > > > > > > We have discussed this before[1]. Global locking is problematic when you > > > actually hit contention. Let's not complicate things until we actually > > > see it. I failed to demonstrate contention without huge pages. With huge > > > pages it is even more dubious that we ever see it. > > > > > > [1] > > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/ > > > > Ah, I see. > > > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge > > How many vCPUs? And were the VMs actually accepting/faulting all 16GiB? > > There's also a noisy neighbor problem lurking. E.g. malicious/buggy VM spams > private<=>shared conversions and thus interferes with PAMT allocations for other > VMs. > > > pages) and then shutting them down. I saw 701 contentions on startup, and 53 > > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either. > > But it probably wasn't much of a cacheline bouncing worse case. > > Isn't the SEAMCALL done while holding the spinlock? I assume the latency of the > SEAMCALL is easily the long pole in the flow. > > > And I guess this is on my latest changes not this exact v2, but it shouldn't > > have changed. > > > > But hmm, it seems Dave's objection about maintaining the lock allocations would > > apply to the refcounts too? But the hotplug concerns shouldn't actually be an > > issue for TDX because they gets rejected if the allocations are not already > > there. So complexity of a per-2MB lock should be minimal, at least > > incrementally. The difference seems more about memory use vs performance. I don't see jump to per-2MB locking remotely justified. We can scale number of locks gradually with the amount of memory in the system: have a power-of-2 set of locks and 2MB range to the lock with %. Note that it is trivial thing to add later on and doesn't need to be part of initial design. > > What gives me pause is in the KVM TDX work we have really tried hard to not take > > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by > > hard numbers. > > Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock > for write is problematic. Even if TDX VMs don't exhibit the same patterns *today* > as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees > that will always hold true. > > > But an enormous amount of work went into lettings KVM faults happen under the > > shared lock for normal VMs. So on one hand, yes it's premature optimization. > > But on the other hand, it's a maintainability concern about polluting the > > existing way things work in KVM with special TDX properties. > > > > I think we need to at least call out loudly that the decision was to go with the > > simplest possible solution, and the impact to KVM. I'm not sure what Sean's > > opinion is, but I wouldn't want him to first learn of it when he went digging > > and found a buried global spin lock in the fault path. > > Heh, too late, I saw it when this was first posted. And to be honest, my initial > reaction was very much "absolutely not" (though Rated R, not PG). Now that I've > had time to think things through, I'm not _totally_ opposed to having a spinlock > in the page fault path, but my overall sentiment remains the same. > > For mmu_lock and related SPTE operations, I was super adamant about not taking > exclusive locks because based on our experience with the TDP MMU, converting flows > from exclusive to shared is usually significantly more work than developing code > for "shared mode" straightaway (and you note above, that wasn't trivial for TDX). > And importantly, those code paths were largely solved problems. I.e. I didn't > want to get into a situation where TDX undid the parallelization of the TDP MMU, > and then had to add it back after the fact. > > I think the same holds true here. I'm not completely opposed to introducing a > spinlock, but I want to either have a very high level of confidence that the lock > won't introduce jitter/delay (I have low confidence on this front, at least in > the proposed patches), or have super clear line of sight to making the contention > irrelevant, without having to rip apart the code. I think there is a big difference with mmu_lock. mmu_lock is analogous to mmap_lock in core-mm. It serializes page fault against other mmu operation and have inherently vast scope. pamt_lock on other hand is at very bottom of callchain and with very limited scope. It is trivially scalable by partitioning. Translating problems you see with mmu_lock onto pamt_lock seems like an overreaction. -- Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Aug 11, 2025 at 7:02 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Aug 11, 2025, Rick P Edgecombe wrote: > > On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote: > > > > I don't see any other reason for the global spin lock, Kirill was that > > > > it? Did you consider also adding a lock per 2MB region, like the > > > > refcount? Or any other granularity of lock besides global? Not saying > > > > global is definitely the wrong choice, but seems arbitrary if I got the > > > > above right. > > > > > > We have discussed this before[1]. Global locking is problematic when you > > > actually hit contention. Let's not complicate things until we actually > > > see it. I failed to demonstrate contention without huge pages. With huge > > > pages it is even more dubious that we ever see it. > > > > > > [1] > > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/ > > > > Ah, I see. > > > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge > > How many vCPUs? And were the VMs actually accepting/faulting all 16GiB? > > There's also a noisy neighbor problem lurking. E.g. malicious/buggy VM spams > private<=>shared conversions and thus interferes with PAMT allocations for other > VMs. > > > pages) and then shutting them down. I saw 701 contentions on startup, and 53 > > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either. > > But it probably wasn't much of a cacheline bouncing worse case. > > Isn't the SEAMCALL done while holding the spinlock? I assume the latency of the > SEAMCALL is easily the long pole in the flow. > > > And I guess this is on my latest changes not this exact v2, but it shouldn't > > have changed. > > > > But hmm, it seems Dave's objection about maintaining the lock allocations would > > apply to the refcounts too? But the hotplug concerns shouldn't actually be an > > issue for TDX because they gets rejected if the allocations are not already > > there. So complexity of a per-2MB lock should be minimal, at least > > incrementally. The difference seems more about memory use vs performance. > > > > What gives me pause is in the KVM TDX work we have really tried hard to not take > > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by > > hard numbers. > > Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock > for write is problematic. Even if TDX VMs don't exhibit the same patterns *today* > as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees > that will always hold true. > > > But an enormous amount of work went into lettings KVM faults happen under the > > shared lock for normal VMs. So on one hand, yes it's premature optimization. > > But on the other hand, it's a maintainability concern about polluting the > > existing way things work in KVM with special TDX properties. > > > > I think we need to at least call out loudly that the decision was to go with the > > simplest possible solution, and the impact to KVM. I'm not sure what Sean's > > opinion is, but I wouldn't want him to first learn of it when he went digging > > and found a buried global spin lock in the fault path. > > Heh, too late, I saw it when this was first posted. And to be honest, my initial > reaction was very much "absolutely not" (though Rated R, not PG). Now that I've > had time to think things through, I'm not _totally_ opposed to having a spinlock > in the page fault path, but my overall sentiment remains the same. > > For mmu_lock and related SPTE operations, I was super adamant about not taking > exclusive locks because based on our experience with the TDP MMU, converting flows > from exclusive to shared is usually significantly more work than developing code > for "shared mode" straightaway (and you note above, that wasn't trivial for TDX). > And importantly, those code paths were largely solved problems. I.e. I didn't > want to get into a situation where TDX undid the parallelization of the TDP MMU, > and then had to add it back after the fact. > > I think the same holds true here. I'm not completely opposed to introducing a > spinlock, but I want to either have a very high level of confidence that the lock > won't introduce jitter/delay (I have low confidence on this front, at least in > the proposed patches), or have super clear line of sight to making the contention > irrelevant, without having to rip apart the code. > > My biggest question at this point is: why is all of this being done on-demand? > IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K > pages at the last possible moment". Neither of those seems ideal. > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand > PAMT management seems reasonable. But for PAMTs that are used to track guest-assigned > memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd? This seems fine for 4K page backing. But when TDX VMs have huge page backing, the vast majority of private memory memory wouldn't need PAMT allocation for 4K granularity. IIUC guest_memfd allocation happening at 2M granularity doesn't necessarily translate to 2M mapping in guest EPT entries. If the DPAMT support is to be properly utilized for huge page backings, there is a value in not attaching PAMT allocation with guest_memfd allocation. > I.e. setup PAMT crud when guest_memfd is populated, not when the memory is mapped > into the guest. That way setups that cares about guest boot time can preallocate > guest_memfd in order to get the PAMT stuff out of the way. > > You could do the same thing by prefaulting guest memory, but TDX has limitations > there, and I see very little value in precisely reclaiming PAMT memory when a > leaf S-EPT is zapped, i.e. when a page is converted from private=>shared. As > above, that's just asking for noisy neighbor issues. > > The complaints with static PAMT are that it required burning 0.4% of memory even > if the host isn't actively running TDX VMs. Burning 0.4% of the memory assigned > to a guest, regardless of whether it's map private or shared, seems acceptable, > and I think would give us a lot more flexibility in avoiding locking issues. > > Similarly, we could bind a PAMT to non-leaf S-EPT pages during mmu_topup_memory_caches(), > i.e. when arch.mmu_external_spt_cache is filled. Then there would be no need for > a separate vcpu->arch.pamt_page_cache, and more work would be done outside of > mmu_lock. Freeing SPTs would still be done under mmu_lock (I think), but that > should be a much rarer operation. >
On Mon, Aug 11, 2025 at 07:31:26PM -0700, Vishal Annapurve wrote: > On Mon, Aug 11, 2025 at 7:02 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Aug 11, 2025, Rick P Edgecombe wrote: > > > On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote: > > > > > I don't see any other reason for the global spin lock, Kirill was that > > > > > it? Did you consider also adding a lock per 2MB region, like the > > > > > refcount? Or any other granularity of lock besides global? Not saying > > > > > global is definitely the wrong choice, but seems arbitrary if I got the > > > > > above right. > > > > > > > > We have discussed this before[1]. Global locking is problematic when you > > > > actually hit contention. Let's not complicate things until we actually > > > > see it. I failed to demonstrate contention without huge pages. With huge > > > > pages it is even more dubious that we ever see it. > > > > > > > > [1] > > > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/ > > > > > > Ah, I see. > > > > > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge > > > > How many vCPUs? And were the VMs actually accepting/faulting all 16GiB? > > > > There's also a noisy neighbor problem lurking. E.g. malicious/buggy VM spams > > private<=>shared conversions and thus interferes with PAMT allocations for other > > VMs. > > > > > pages) and then shutting them down. I saw 701 contentions on startup, and 53 > > > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either. > > > But it probably wasn't much of a cacheline bouncing worse case. > > > > Isn't the SEAMCALL done while holding the spinlock? I assume the latency of the > > SEAMCALL is easily the long pole in the flow. > > > > > And I guess this is on my latest changes not this exact v2, but it shouldn't > > > have changed. > > > > > > But hmm, it seems Dave's objection about maintaining the lock allocations would > > > apply to the refcounts too? But the hotplug concerns shouldn't actually be an > > > issue for TDX because they gets rejected if the allocations are not already > > > there. So complexity of a per-2MB lock should be minimal, at least > > > incrementally. The difference seems more about memory use vs performance. > > > > > > What gives me pause is in the KVM TDX work we have really tried hard to not take > > > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by > > > hard numbers. > > > > Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock > > for write is problematic. Even if TDX VMs don't exhibit the same patterns *today* > > as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees > > that will always hold true. > > > > > But an enormous amount of work went into lettings KVM faults happen under the > > > shared lock for normal VMs. So on one hand, yes it's premature optimization. > > > But on the other hand, it's a maintainability concern about polluting the > > > existing way things work in KVM with special TDX properties. > > > > > > I think we need to at least call out loudly that the decision was to go with the > > > simplest possible solution, and the impact to KVM. I'm not sure what Sean's > > > opinion is, but I wouldn't want him to first learn of it when he went digging > > > and found a buried global spin lock in the fault path. > > > > Heh, too late, I saw it when this was first posted. And to be honest, my initial > > reaction was very much "absolutely not" (though Rated R, not PG). Now that I've > > had time to think things through, I'm not _totally_ opposed to having a spinlock > > in the page fault path, but my overall sentiment remains the same. > > > > For mmu_lock and related SPTE operations, I was super adamant about not taking > > exclusive locks because based on our experience with the TDP MMU, converting flows > > from exclusive to shared is usually significantly more work than developing code > > for "shared mode" straightaway (and you note above, that wasn't trivial for TDX). > > And importantly, those code paths were largely solved problems. I.e. I didn't > > want to get into a situation where TDX undid the parallelization of the TDP MMU, > > and then had to add it back after the fact. > > > > I think the same holds true here. I'm not completely opposed to introducing a > > spinlock, but I want to either have a very high level of confidence that the lock > > won't introduce jitter/delay (I have low confidence on this front, at least in > > the proposed patches), or have super clear line of sight to making the contention > > irrelevant, without having to rip apart the code. > > > > My biggest question at this point is: why is all of this being done on-demand? > > IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K > > pages at the last possible moment". Neither of those seems ideal. > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand > > PAMT management seems reasonable. But for PAMTs that are used to track guest-assigned > > memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd? > > This seems fine for 4K page backing. But when TDX VMs have huge page > backing, the vast majority of private memory memory wouldn't need PAMT > allocation for 4K granularity. > > IIUC guest_memfd allocation happening at 2M granularity doesn't > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT > support is to be properly utilized for huge page backings, there is a > value in not attaching PAMT allocation with guest_memfd allocation. Right. It also requires special handling in many places in core-mm. Like, what happens if THP in guest memfd got split. Who would allocate PAMT for it? Migration will be more complicated too (when we get there). -- Kiryl Shutsemau / Kirill A. Shutemov
On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote: > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, > > > on-demand > > > PAMT management seems reasonable. But for PAMTs that are used to track > > > guest-assigned > > > memory, which is the vaaast majority of PAMT memory, why not hook > > > guest_memfd? > > > > This seems fine for 4K page backing. But when TDX VMs have huge page > > backing, the vast majority of private memory memory wouldn't need PAMT > > allocation for 4K granularity. > > > > IIUC guest_memfd allocation happening at 2M granularity doesn't > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT > > support is to be properly utilized for huge page backings, there is a > > value in not attaching PAMT allocation with guest_memfd allocation. > > Right. > > It also requires special handling in many places in core-mm. Like, what > happens if THP in guest memfd got split. Who would allocate PAMT for it? > Migration will be more complicated too (when we get there). I actually went down this path too, but the problem I hit was that TDX module wants the PAMT page size to match the S-EPT page size. And the S-EPT size will depend on runtime behavior of the guest. I'm not sure why TDX module requires this though. Kirill, I'd be curious to understand the constraint more if you recall. But in any case, it seems there are multiple reasons.
On Tue, Aug 12, 2025 at 03:12:52PM +0000, Edgecombe, Rick P wrote: > On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote: > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, > > > > on-demand > > > > PAMT management seems reasonable. But for PAMTs that are used to track > > > > guest-assigned > > > > memory, which is the vaaast majority of PAMT memory, why not hook > > > > guest_memfd? > > > > > > This seems fine for 4K page backing. But when TDX VMs have huge page > > > backing, the vast majority of private memory memory wouldn't need PAMT > > > allocation for 4K granularity. > > > > > > IIUC guest_memfd allocation happening at 2M granularity doesn't > > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT > > > support is to be properly utilized for huge page backings, there is a > > > value in not attaching PAMT allocation with guest_memfd allocation. > > > > Right. > > > > It also requires special handling in many places in core-mm. Like, what > > happens if THP in guest memfd got split. Who would allocate PAMT for it? > > Migration will be more complicated too (when we get there). > > I actually went down this path too, but the problem I hit was that TDX module > wants the PAMT page size to match the S-EPT page size. And the S-EPT size will > recall. With DPAMT, when you pass page pair to PAMT.ADD they will be stored in the PAMT_2M entry. So PAMT_2M entry cannot be used as a leaf entry anymore. In theory, TDX module could stash them somewhere else, like generic memory pool to be used for PAMT_4K when needed. But it is significantly different design to what we have now with different set of problems. -- Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Aug 12, 2025, Rick P Edgecombe wrote: > On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote: > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT > > > > pages, on-demand PAMT management seems reasonable. But for PAMTs that > > > > are used to track guest-assigned memory, which is the vaaast majority > > > > of PAMT memory, why not hook guest_memfd? > > > > > > This seems fine for 4K page backing. But when TDX VMs have huge page > > > backing, the vast majority of private memory memory wouldn't need PAMT > > > allocation for 4K granularity. > > > > > > IIUC guest_memfd allocation happening at 2M granularity doesn't > > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT > > > support is to be properly utilized for huge page backings, there is a > > > value in not attaching PAMT allocation with guest_memfd allocation. I don't disagree, but the host needs to plan for the worst, especially since the guest can effectively dictate the max page size of S-EPT mappings. AFAIK, there are no plans to support memory overcommit for TDX guests, so unless a deployment wants to roll the dice and hope TDX guests will use hugepages for N% of their memory, the host will want to reserve 0.4% of guest memory for PAMTs to ensure it doesn't unintentionally DoS the guest with an OOM condition. Ditto for any use case that wants to support dirty logging (ugh), because dirty logging will require demoting all of guest memory to 4KiB mappings. > > Right. > > > > It also requires special handling in many places in core-mm. Like, what > > happens if THP in guest memfd got split. Who would allocate PAMT for it? guest_memfd? I don't see why core-mm would need to get involved. And I definitely don't see how handling page splits in guest_memfd would be more complicated than handling them in KVM's MMU. > > Migration will be more complicated too (when we get there). Which type of migration? Live migration or page migration? > I actually went down this path too, but the problem I hit was that TDX module > wants the PAMT page size to match the S-EPT page size. Right, but over-populating the PAMT would just result in "wasted" memory, correct? I.e. KVM can always provide more PAMT entries than are needed. Or am I misunderstanding how dynamic PAMT works? In other words, IMO, reclaiming PAMT pages on-demand is also a premature optimization of sorts, as it's not obvious to me that the host would actually be able to take advantage of the unused memory. > And the S-EPT size will depend on runtime behavior of the guest. I'm not sure > why TDX module requires this though. Kirill, I'd be curious to understand the > constraint more if you recall. > > But in any case, it seems there are multiple reasons.
On Tue, Aug 12, 2025 at 09:15:16AM -0700, Sean Christopherson wrote: > On Tue, Aug 12, 2025, Rick P Edgecombe wrote: > > On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote: > > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT > > > > > pages, on-demand PAMT management seems reasonable. But for PAMTs that > > > > > are used to track guest-assigned memory, which is the vaaast majority > > > > > of PAMT memory, why not hook guest_memfd? > > > > > > > > This seems fine for 4K page backing. But when TDX VMs have huge page > > > > backing, the vast majority of private memory memory wouldn't need PAMT > > > > allocation for 4K granularity. > > > > > > > > IIUC guest_memfd allocation happening at 2M granularity doesn't > > > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT > > > > support is to be properly utilized for huge page backings, there is a > > > > value in not attaching PAMT allocation with guest_memfd allocation. > > I don't disagree, but the host needs to plan for the worst, especially since the > guest can effectively dictate the max page size of S-EPT mappings. AFAIK, there > are no plans to support memory overcommit for TDX guests, so unless a deployment > wants to roll the dice and hope TDX guests will use hugepages for N% of their > memory, the host will want to reserve 0.4% of guest memory for PAMTs to ensure > it doesn't unintentionally DoS the guest with an OOM condition. > > Ditto for any use case that wants to support dirty logging (ugh), because dirty > logging will require demoting all of guest memory to 4KiB mappings. > > > > Right. > > > > > > It also requires special handling in many places in core-mm. Like, what > > > happens if THP in guest memfd got split. Who would allocate PAMT for it? > > guest_memfd? I don't see why core-mm would need to get involved. And I definitely > don't see how handling page splits in guest_memfd would be more complicated than > handling them in KVM's MMU. > > > > Migration will be more complicated too (when we get there). > > Which type of migration? Live migration or page migration? Page migration. But I think after some reading, it can be manageable. -- Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Aug 12, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Aug 12, 2025, Rick P Edgecombe wrote: > > On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote: > > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT > > > > > pages, on-demand PAMT management seems reasonable. But for PAMTs that > > > > > are used to track guest-assigned memory, which is the vaaast majority > > > > > of PAMT memory, why not hook guest_memfd? > > > > > > > > This seems fine for 4K page backing. But when TDX VMs have huge page > > > > backing, the vast majority of private memory memory wouldn't need PAMT > > > > allocation for 4K granularity. > > > > > > > > IIUC guest_memfd allocation happening at 2M granularity doesn't > > > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT > > > > support is to be properly utilized for huge page backings, there is a > > > > value in not attaching PAMT allocation with guest_memfd allocation. > > I don't disagree, but the host needs to plan for the worst, especially since the > guest can effectively dictate the max page size of S-EPT mappings. AFAIK, there > are no plans to support memory overcommit for TDX guests, so unless a deployment > wants to roll the dice and hope TDX guests will use hugepages for N% of their > memory, the host will want to reserve 0.4% of guest memory for PAMTs to ensure > it doesn't unintentionally DoS the guest with an OOM condition. Reasonable guest VMs (e.g. Linux) will generally map things mostly at hugepage granularity, I don't think there is a reason here to be more conservative and just increase the cost for the well behaved guests. That being said, The scenario of an unreasonable guest could be covered in future by modifying how PAMT allocation is accounted/charged. Guests are generally free to use the lazy pvalidate/accept features so the host can't guarantee the needed PAMT memory to be always there anyways.
On Tue, 2025-08-12 at 09:15 -0700, Sean Christopherson wrote: > > I actually went down this path too, but the problem I hit was that TDX > > module wants the PAMT page size to match the S-EPT page size. > > Right, but over-populating the PAMT would just result in "wasted" memory, > correct? I.e. KVM can always provide more PAMT entries than are needed. Or am > I misunderstanding how dynamic PAMT works? Demote needs DPAMT pages in order to split the DPAMT. But "needs" is what I was hoping to understand better. I do think though, that we should consider premature optimization vs re- architecting DPAMT only for the sake of a short term KVM design. As in, if fault path managed DPAMT is better for the whole lazy accept way of things, it probably makes more sense to just do it upfront with the existing architecture. BTW, I think I untangled the fault path DPAMT page allocation code in this series. I basically moved the existing external page cache allocation to kvm/vmx/tdx.c. So the details of the top up and external page table cache happens outside of x86 mmu code. The top up structure comes from arch/x86 side of tdx code, so the cache can just be passed into tdx_pamt_get(). And from the MMU code's perspective there is just one type "external page tables". It doesn't know about DPAMT at all. So if that ends up acceptable, I think the main problem left is just this global lock. And it seems we have a simple solution for it if needed. > > In other words, IMO, reclaiming PAMT pages on-demand is also a premature > optimization of sorts, as it's not obvious to me that the host would actually > be able to take advantage of the unused memory. I was imagining some guestmemfd callback to setup DPAMT backing for all the private memory. Just leave it when it's shared for simplicity. Then cleanup DPAMT when the pages are freed from guestmemfd. The control pages could have their own path like it does in this series. But it doesn't seem supported.
> On Tue, Aug 12, 2025 at 11:39 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2025-08-12 at 09:15 -0700, Sean Christopherson wrote: > > > I actually went down this path too, but the problem I hit was that TDX > > > module wants the PAMT page size to match the S-EPT page size. > > > > Right, but over-populating the PAMT would just result in "wasted" memory, > > correct? I.e. KVM can always provide more PAMT entries than are needed. Or am > > I misunderstanding how dynamic PAMT works? > > Demote needs DPAMT pages in order to split the DPAMT. But "needs" is what I was > hoping to understand better. > > I do think though, that we should consider premature optimization vs re- > architecting DPAMT only for the sake of a short term KVM design. As in, if fault > path managed DPAMT is better for the whole lazy accept way of things, it > probably makes more sense to just do it upfront with the existing architecture. > > BTW, I think I untangled the fault path DPAMT page allocation code in this > series. I basically moved the existing external page cache allocation to > kvm/vmx/tdx.c. So the details of the top up and external page table cache > happens outside of x86 mmu code. The top up structure comes from arch/x86 side > of tdx code, so the cache can just be passed into tdx_pamt_get(). And from the > MMU code's perspective there is just one type "external page tables". It doesn't > know about DPAMT at all. > > So if that ends up acceptable, I think the main problem left is just this global > lock. And it seems we have a simple solution for it if needed. > > > > > In other words, IMO, reclaiming PAMT pages on-demand is also a premature > > optimization of sorts, as it's not obvious to me that the host would actually > > be able to take advantage of the unused memory. > > I was imagining some guestmemfd callback to setup DPAMT backing for all the > private memory. Just leave it when it's shared for simplicity. Then cleanup > DPAMT when the pages are freed from guestmemfd. The control pages could have > their own path like it does in this series. But it doesn't seem supported. IMO, tieing lifetime of guest_memfd folios with that of KVM ownership beyond the memslot lifetime is leaking more state into guest_memfd than needed. e.g. This will prevent usecases where guest_memfd needs to be reused while handling reboot of a confidential VM [1]. IMO, if avoidable, its better to not have DPAMT or generally other KVM arch specific state tracking hooked up to guest memfd folios specially with hugepage support and whole folio splitting/merging that needs to be done. If you still need it, guest_memfd should be stateless as much as possible just like we are pushing for SNP preparation tracking [2] to happen within KVM SNP and IMO any such tracking should ideally be cleaned up on memslot unbinding. [1] https://lore.kernel.org/kvm/CAGtprH9NbCPSwZrQAUzFw=4rZPA60QBM2G8opYo9CZxRiYihzg@mail.gmail.com/ [2] https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/
On Tue, 2025-08-12 at 15:00 -0700, Vishal Annapurve wrote: > IMO, tieing lifetime of guest_memfd folios with that of KVM ownership > beyond the memslot lifetime is leaking more state into guest_memfd > than needed. e.g. This will prevent usecases where guest_memfd needs > to be reused while handling reboot of a confidential VM [1]. How does it prevent this? If you really want to re-use guest memory in a fast way then I think you would want the DPAMT to remain in place actually. It sounds like an argument to trigger the add/remove from guestmemfd actually. But I really question with all the work to rebuild S-EPT, and if you propose DPAMT too, how much work is really gained by not needing to reallocate hugetlbfs pages. Do you see how it could be surprising? I'm currently assuming there is some missing context. > > IMO, if avoidable, its better to not have DPAMT or generally other KVM > arch specific state tracking hooked up to guest memfd folios specially > with hugepage support and whole folio splitting/merging that needs to > be done. If you still need it, guest_memfd should be stateless as much > as possible just like we are pushing for SNP preparation tracking [2] > to happen within KVM SNP and IMO any such tracking should ideally be > cleaned up on memslot unbinding. I'm not sure gmemfd would need to track state. It could be a callback. But it may be academic anyway. Below... > > [1] https://lore.kernel.org/kvm/CAGtprH9NbCPSwZrQAUzFw=4rZPA60QBM2G8opYo9CZxRiYihzg@mail.gmail.com/ > [2] https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/ Looking into that more, from the code it seems it's not quite so straightforward. Demote will always require new DPAMT pages to be passed, and promote will always remove the 4k DPAMT entries and pass them back to the host. But on early reading, 2MB PAGE.AUG looks like it can handle the DPAMT being mapped at 4k. So maybe there is some wiggle room there? But before I dig further, I think I've heard 4 possible arguments for keeping the existing design: 1. TDX module may require or at least push the caller to have S-EPT match DPAMT size (confirmation TBD) 2. Mapping DPAMT all at 4k requires extra memory for TD huge pages 3. It *may* slow TD boots because things can't be lazily installed via the fault path. (testing not done) 4. While the global lock is bad, there is an easy fix for that if it is needed. It seems Vishal cares a lot about (2). So I'm wondering if we need to keep going down this path. In the meantime, I'm going to try to get some better data on the global lock contention (Sean's question about how much of the memory was actually faulted in).
On Tue, Aug 12, 2025 at 4:35 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2025-08-12 at 15:00 -0700, Vishal Annapurve wrote: > > IMO, tieing lifetime of guest_memfd folios with that of KVM ownership > > beyond the memslot lifetime is leaking more state into guest_memfd > > than needed. e.g. This will prevent usecases where guest_memfd needs > > to be reused while handling reboot of a confidential VM [1]. > > How does it prevent this? If you really want to re-use guest memory in a fast > way then I think you would want the DPAMT to remain in place actually. It sounds > like an argument to trigger the add/remove from guestmemfd actually. With the reboot usecase, a new VM may start with it's own new HKID so I don't think we can preserve any state that's specific to the previous instance. We can only reduce the amount of state to be maintained in SEPTs/DPAMTs by using hugepages wherever possible. > > But I really question with all the work to rebuild S-EPT, and if you propose > DPAMT too, how much work is really gained by not needing to reallocate hugetlbfs > pages. Do you see how it could be surprising? I'm currently assuming there is > some missing context. I would not limit the reboot usecase to just hugepage backing scenario. guest_memfd folios (and ideally the guest_memfd files themselves) simply should be reusable outside the VM lifecycle irrespective of whether it's used to back CoCo VMs or not.
On Tue, 2025-08-12 at 17:18 -0700, Vishal Annapurve wrote: > On Tue, Aug 12, 2025 at 4:35 PM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > > > > On Tue, 2025-08-12 at 15:00 -0700, Vishal Annapurve wrote: > > > IMO, tieing lifetime of guest_memfd folios with that of KVM ownership > > > beyond the memslot lifetime is leaking more state into guest_memfd > > > than needed. e.g. This will prevent usecases where guest_memfd needs > > > to be reused while handling reboot of a confidential VM [1]. > > > > How does it prevent this? If you really want to re-use guest memory in a fast > > way then I think you would want the DPAMT to remain in place actually. It sounds > > like an argument to trigger the add/remove from guestmemfd actually. > > With the reboot usecase, a new VM may start with it's own new HKID so > I don't think we can preserve any state that's specific to the > previous instance. We can only reduce the amount of state to be > maintained in SEPTs/DPAMTs by using hugepages wherever possible. This is saying that the S-EPT can't be preserved, which doesn't really help me understand why page allocations are such a big source of the work. I guess you are saying it's the only thing possible to do? Hmm, I'm not sure why the S-EPT couldn't be preserved, especially when you allow for changes to KVM or the TDX module. But if we are trying to solve the problem of making TD reboot faster, let's figure out the biggest things that are making it slow first and work on that. Like it's missing a lot of context on why this turned out to be the right optimization to do. Disclaimer: This optimization may be great for other types of VMs and that is all well and good, but the points are about TDX here and the justification of the TD reboot optimization is relevant to how we implement DPAMT. > > > > > But I really question with all the work to rebuild S-EPT, and if you propose > > DPAMT too, how much work is really gained by not needing to reallocate hugetlbfs > > pages. Do you see how it could be surprising? I'm currently assuming there is > > some missing context. > > I would not limit the reboot usecase to just hugepage backing > scenario. guest_memfd folios (and ideally the guest_memfd files > themselves) simply should be reusable outside the VM lifecycle > irrespective of whether it's used to back CoCo VMs or not. Still surprised that host page allocations turned out to be the biggest thing sticking out.
On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote: > This patchset enables Dynamic PAMT in TDX. Please review. > > Previously, we thought it can get upstreamed after huge page support, but > huge pages require support on guestmemfd side which might take time to hit > upstream. Dynamic PAMT doesn't have dependencies. Did you run this through the latest TDX selftests? Specifically Reinette's WIP MMU stress test would be real good to test on this.
On Wed, Jun 25, 2025 at 10:49:16PM +0000, Edgecombe, Rick P wrote: > On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote: > > This patchset enables Dynamic PAMT in TDX. Please review. > > > > Previously, we thought it can get upstreamed after huge page support, but > > huge pages require support on guestmemfd side which might take time to hit > > upstream. Dynamic PAMT doesn't have dependencies. > > Did you run this through the latest TDX selftests? Specifically Reinette's WIP > MMU stress test would be real good to test on this. I didn't. Will do. -- Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Jun 09, 2025 at 10:13:28PM +0300, Kirill A. Shutemov wrote: > This patchset enables Dynamic PAMT in TDX. Please review. Gentle ping? The patchset is crucial to get TDX adopted. The upfront memory cost of enabling TDX for the machine is too high, especially if you don't know in advance if the machine will run TDX guests. -- Kiryl Shutsemau / Kirill A. Shutemov
© 2016 - 2025 Red Hat, Inc.