arch/x86/include/asm/kvm-x86-ops.h | 2 + arch/x86/include/asm/kvm_host.h | 5 + arch/x86/include/asm/set_memory.h | 2 + arch/x86/include/asm/tdx.h | 22 ++ arch/x86/include/asm/tdx_global_metadata.h | 1 + arch/x86/kvm/mmu/mmu.c | 10 + arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++- arch/x86/kvm/vmx/main.c | 2 + arch/x86/kvm/vmx/tdx.c | 215 ++++++++++++++++++-- arch/x86/kvm/vmx/tdx_errno.h | 1 + arch/x86/kvm/vmx/x86_ops.h | 9 + arch/x86/mm/Makefile | 2 + arch/x86/mm/meminfo.c | 11 + arch/x86/mm/pat/set_memory.c | 2 +- arch/x86/virt/vmx/tdx/tdx.c | 211 ++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 5 +- arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 + virt/kvm/kvm_main.c | 1 + 18 files changed, 522 insertions(+), 29 deletions(-) create mode 100644 arch/x86/mm/meminfo.c
This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
applied, but rather to receive early feedback on the feature design and
enabling.
From our perspective, this feature has a lower priority compared to huge
page support. I will rebase this patchset on top of Yan's huge page
enabling at a later time, as it requires additional work.
Any feedback is welcome. We are open to ideas.
=========================================================================
The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization.
The exact size of the required PAMT memory is determined by the TDX
module and may vary between TDX module versions, but currently it is
approximately 0.4% of the system memory. This is a significant
commitment, especially if it is not known upfront whether the machine
will run any TDX guests.
The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
PAMT_2M levels are still allocated on TDX module initialization, but the
PAMT_4K level is allocated dynamically, reducing static allocations to
approximately 0.004% of the system memory.
PAMT memory is dynamically allocated as pages gain TDX protections.
It is reclaimed when TDX protections have been removed from all
pages in a contiguous area.
TODO:
- Rebase on top of Yan's huge page support series. Demotion requires
additional handling with Dynamic PAMT;
- Get better vmalloc API from core-mm and simplify patch 02/12.
Kirill A. Shutemov (12):
x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
x86/virt/tdx: Allocate reference counters for PAMT memory
x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo
KVM: TDX: Add tdx_pamt_get()/put() helpers
KVM: TDX: Allocate PAMT memory in __tdx_td_init()
KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
KVM: TDX: Preallocate PAMT pages to be used in page fault path
KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops
KVM: TDX: Reclaim PAMT memory
x86/virt/tdx: Enable Dynamic PAMT
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 5 +
arch/x86/include/asm/set_memory.h | 2 +
arch/x86/include/asm/tdx.h | 22 ++
arch/x86/include/asm/tdx_global_metadata.h | 1 +
arch/x86/kvm/mmu/mmu.c | 10 +
arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++-
arch/x86/kvm/vmx/main.c | 2 +
arch/x86/kvm/vmx/tdx.c | 215 ++++++++++++++++++--
arch/x86/kvm/vmx/tdx_errno.h | 1 +
arch/x86/kvm/vmx/x86_ops.h | 9 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/meminfo.c | 11 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/virt/vmx/tdx/tdx.c | 211 ++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx.h | 5 +-
arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 +
virt/kvm/kvm_main.c | 1 +
18 files changed, 522 insertions(+), 29 deletions(-)
create mode 100644 arch/x86/mm/meminfo.c
--
2.47.2
On Fri, 2 May 2025 16:08:16 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote: > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to > be applied, but rather to receive early feedback on the feature > design and enabling. > > From our perspective, this feature has a lower priority compared to > huge page support. I will rebase this patchset on top of Yan's huge > page enabling at a later time, as it requires additional work. > > Any feedback is welcome. We are open to ideas. > Do we have any estimation on how much extra cost on TVM creation/destroy when tightly couple the PAMT allocation/de-allocation to the private page allocation/de-allocation? It has been trendy nowadays that meta pages are required to be given to the TSM when doing stuff with private page in many platforms. When the pool of the meta page is extensible/shrinkable, there are always ideas about batch pre-charge the pool and lazy batch reclaim it at a certain path for performance considerations or VM characteristics. That might turn into a vendor-agnostic path in KVM with tunable configurations. Z. > ========================================================================= > > The Physical Address Metadata Table (PAMT) holds TDX metadata for > physical memory and must be allocated by the kernel during TDX module > initialization. > > The exact size of the required PAMT memory is determined by the TDX > module and may vary between TDX module versions, but currently it is > approximately 0.4% of the system memory. This is a significant > commitment, especially if it is not known upfront whether the machine > will run any TDX guests. > > The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and > PAMT_2M levels are still allocated on TDX module initialization, but > the PAMT_4K level is allocated dynamically, reducing static > allocations to approximately 0.004% of the system memory. > > PAMT memory is dynamically allocated as pages gain TDX protections. > It is reclaimed when TDX protections have been removed from all > pages in a contiguous area. > > TODO: > - Rebase on top of Yan's huge page support series. Demotion requires > additional handling with Dynamic PAMT; > - Get better vmalloc API from core-mm and simplify patch 02/12. > > Kirill A. Shutemov (12): > x86/virt/tdx: Allocate page bitmap for Dynamic PAMT > x86/virt/tdx: Allocate reference counters for PAMT memory > x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE > x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo > KVM: TDX: Add tdx_pamt_get()/put() helpers > KVM: TDX: Allocate PAMT memory in __tdx_td_init() > KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init() > KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to > kvm_x86_ops KVM: TDX: Preallocate PAMT pages to be used in page fault > path KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops > KVM: TDX: Reclaim PAMT memory > x86/virt/tdx: Enable Dynamic PAMT > > arch/x86/include/asm/kvm-x86-ops.h | 2 + > arch/x86/include/asm/kvm_host.h | 5 + > arch/x86/include/asm/set_memory.h | 2 + > arch/x86/include/asm/tdx.h | 22 ++ > arch/x86/include/asm/tdx_global_metadata.h | 1 + > arch/x86/kvm/mmu/mmu.c | 10 + > arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++- > arch/x86/kvm/vmx/main.c | 2 + > arch/x86/kvm/vmx/tdx.c | 215 > ++++++++++++++++++-- arch/x86/kvm/vmx/tdx_errno.h | > 1 + arch/x86/kvm/vmx/x86_ops.h | 9 + > arch/x86/mm/Makefile | 2 + > arch/x86/mm/meminfo.c | 11 + > arch/x86/mm/pat/set_memory.c | 2 +- > arch/x86/virt/vmx/tdx/tdx.c | 211 ++++++++++++++++++- > arch/x86/virt/vmx/tdx/tdx.h | 5 +- > arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 3 + > virt/kvm/kvm_main.c | 1 + > 18 files changed, 522 insertions(+), 29 deletions(-) > create mode 100644 arch/x86/mm/meminfo.c >
On Wed, May 14, 2025 at 11:33:17PM +0300, Zhi Wang wrote: > On Fri, 2 May 2025 16:08:16 +0300 > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote: > > > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to > > be applied, but rather to receive early feedback on the feature > > design and enabling. > > > > From our perspective, this feature has a lower priority compared to > > huge page support. I will rebase this patchset on top of Yan's huge > > page enabling at a later time, as it requires additional work. > > > > Any feedback is welcome. We are open to ideas. > > > > Do we have any estimation on how much extra cost on TVM creation/destroy > when tightly couple the PAMT allocation/de-allocation to the private > page allocation/de-allocation? It has been trendy nowadays that > meta pages are required to be given to the TSM when doing stuff with > private page in many platforms. When the pool of the meta page is > extensible/shrinkable, there are always ideas about batch pre-charge the > pool and lazy batch reclaim it at a certain path for performance > considerations or VM characteristics. That might turn into a > vendor-agnostic path in KVM with tunable configurations. It depends on the pages that the page allocator gives to TD. If memory is not fragmented and TD receives memory from the same 2M chunks, we do not need much PAMT memory and we do not need to make additional SEAMCALLs to add it. It also depends on the availability of huge pages. From my tests, a typical TD boot takes about 20 MiB of PAMT memory if no huge pages are allocated and about 2MiB with huge pages. The overhead on its management is negligible, especially with huge pages: approximately 256 SEAMCALLs to add PAMT pages and the same number to remove. The consumption of PAMT memory for booting does not increase significantly with the size of TD as the guest accepts memory lazily. However, it will increase as more memory is accepted if huge pages are not used. I don't think we can justify any batching here. -- Kiryl Shutsemau / Kirill A. Shutemov
On 5/15/25 02:17, Kirill A. Shutemov wrote: > I don't think we can justify any batching here. The is one primary goal here: Reduce TDX overhead when not running TDX guests. It has the side-effect of being _able_ to reduce the amount of memory that TDX guests use when using >=2M pages only. It has the theoretical capability to do the same for 4k users but only when the pages are quite contiguous. Right? The "not running TDX guest" and ">=2M pages" benefits are relatively easy. The 4k one is hard and is going to take a lot more work. Could we please focus on the easy one for now and not get distracted by the hard one that might not even be worth it in the end?
On Fri, May 02, 2025, Kirill A. Shutemov wrote: > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be > applied, but rather to receive early feedback on the feature design and > enabling. In that case, please describe the design, and specifically *why* you chose this particular design, along with the constraints and rules of dynamic PAMTs that led to that decision. It would also be very helpful to know what options you considered and discarded, so that others don't waste time coming up with solutions that you already rejected. > >From our perspective, this feature has a lower priority compared to huge > page support. I will rebase this patchset on top of Yan's huge page > enabling at a later time, as it requires additional work. > > Any feedback is welcome. We are open to ideas. > > ========================================================================= > > The Physical Address Metadata Table (PAMT) holds TDX metadata for > physical memory and must be allocated by the kernel during TDX module > initialization. > > The exact size of the required PAMT memory is determined by the TDX > module and may vary between TDX module versions, but currently it is > approximately 0.4% of the system memory. This is a significant > commitment, especially if it is not known upfront whether the machine > will run any TDX guests. > > The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and > PAMT_2M levels are still allocated on TDX module initialization, but the > PAMT_4K level is allocated dynamically, reducing static allocations to > approximately 0.004% of the system memory. > > PAMT memory is dynamically allocated as pages gain TDX protections. > It is reclaimed when TDX protections have been removed from all > pages in a contiguous area.
On Wed, May 14, 2025 at 06:41:10AM -0700, Sean Christopherson wrote: > On Fri, May 02, 2025, Kirill A. Shutemov wrote: > > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be > > applied, but rather to receive early feedback on the feature design and > > enabling. > > In that case, please describe the design, and specifically *why* you chose this > particular design, along with the constraints and rules of dynamic PAMTs that > led to that decision. It would also be very helpful to know what options you > considered and discarded, so that others don't waste time coming up with solutions > that you already rejected. Dynamic PAMT support in TDX module ================================== Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as needed. PAMT_1G and PAMT_2M are still allocated statically at the time of TDX module initialization. At init stage allocation of PAMT_4K is replaced with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k. VMM is responsible for allocating and freeing PAMT_4K. There's a pair of new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They add/remove PAMT memory in form of page pair. There's no requirement for these pages to be contiguous. Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region. It allows any 4K from the region to be usable by TDX module. With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT memory (TDX_MISSING_PAMT_PAGE_PAIR): - TDH.MNG.CREATE - TDH.MNG.ADDCX - TDH.VP.ADDCX - TDH.VP.CREATE - TDH.MEM.PAGE.ADD - TDH.MEM.PAGE.AUG - TDH.MEM.PAGE.DEMOTE - TDH.MEM.PAGE.RELOCATE Basically, if you supply memory to a TD, this memory has to backed by PAMT memory. Once no TD uses the 2M range, the PAMT page pair can be reclaimed with TDH.PHYMEM.PAMT.REMOVE. TDX module track PAMT memory usage and can give VMM a hint that PAMT memory can be removed. Such hint is provided from all SEAMCALLs that removes memory from TD: - TDH.MEM.SEPT.REMOVE - TDH.MEM.PAGE.REMOVE - TDH.MEM.PAGE.PROMOTE - TDH.MEM.PAGE.RELOCATE - TDH.PHYMEM.PAGE.RECLAIM With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer needed PAMT page pair. PAMT memory is global resource and not tied to a specific TD. TDX modules maintains PAMT memory in a radix tree addressed by physical address. Each entry in the tree can be locked with shared or exclusive lock. Any modification of the tree requires exclusive lock. Any SEAMCALL that takes explicit HPA as an argument will walk the tree taking shared lock on entries. It required to make sure that the page pointed by HPA is of compatible type for the usage. TDCALLs don't take PAMT locks as none of the take HPA as an argument. Dynamic PAMT enabling in kernel =============================== Kernel maintains refcounts for every 2M regions with two helpers tdx_pamt_get() and tdx_pamt_put(). The refcount represents number of users for the PAMT memory in the region. Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and TDH.PHYMEM.PAMT.REMOVE on transition 1->0. PAMT memory gets allocated as part of TD init, VCPU init, on populating SEPT tree and adding guest memory (both during TD build and via AUG on accept). PAMT memory removed on reclaim of control pages and guest memory. Populating PAMT memory on fault is tricky as we cannot allocate memory from the context where it is needed. I introduced a pair of kvm_x86_ops to allocate PAMT memory from a per-VCPU pool from context where VCPU is still around and free it on failuire. This flow will likely be reworked in next versions. Previous attempt on Dynamic PAMT enabling ========================================= My initial kernel enabling attempt was quite different. I wanted to make PAMT allocation lazy: only try to add PAMT page pair if a SEAMCALL fails due to missing PAMT and reclaim it back based on hint provided by the TDX module. The motivation was to avoid duplication of PAMT memory refcounting that TDX module does on kernel side. This approach is inherently more racy as we don't serialize PAMT memory add/remove against SEAMCALLs that uses add/remove memory for a TD. Such serialization would require global locking which is no-go. I made this approach work, but at some point I realized that it cannot be robust as long as we want to avoid TDX_OPERAND_BUSY loops. TDX_OPERAND_BUSY will pop up as result of the races I mentioned above. I gave up on this approach and went with the current one which uses explicit refcounting. Brain dumped. Let me know if anything is unclear. -- Kiryl Shutsemau / Kirill A. Shutemov
On 5/15/25 07:22, Kirill A. Shutemov wrote: > VMM is responsible for allocating and freeing PAMT_4K. There's a pair of > new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They > add/remove PAMT memory in form of page pair. There's no requirement for > these pages to be contiguous. BTW, that second sentence is a little goofy. Is it talking about ADD/REMOVE being a matched pair? Or that there needs to be 8k of metadata storage provided to each ADD/REMOVE call? One thing I've noticed in writing changelogs and so forth is that repetition can hurt understanding if the concepts aren't the same. Like saying there is a "pair" of calls and a "pair" of pages when the fact that both are pairs is a coincidence rather than an intentional and important part of the design.
On Thu, May 15, 2025 at 08:03:28AM -0700, Dave Hansen wrote: > On 5/15/25 07:22, Kirill A. Shutemov wrote: > > VMM is responsible for allocating and freeing PAMT_4K. There's a pair of > > new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They > > add/remove PAMT memory in form of page pair. There's no requirement for > > these pages to be contiguous. > > BTW, that second sentence is a little goofy. Is it talking about > ADD/REMOVE being a matched pair? Or that there needs to be 8k of > metadata storage provided to each ADD/REMOVE call? Both :P Pair of SEAMCALLs operate on pairs of pages. > One thing I've noticed in writing changelogs and so forth is that > repetition can hurt understanding if the concepts aren't the same. Like > saying there is a "pair" of calls and a "pair" of pages when the fact > that both are pairs is a coincidence rather than an intentional and > important part of the design. Yeah, I see it. I will try to avoid to "pair" for SEAMCALLs in Dynamic PAMT context. Maybe it will clear up the confusion. -- Kiryl Shutsemau / Kirill A. Shutemov
© 2016 - 2026 Red Hat, Inc.