[RFC, PATCH 00/12] TDX: Enable Dynamic PAMT

Kirill A. Shutemov posted 12 patches 9 months, 1 week ago
There is a newer version of this series
arch/x86/include/asm/kvm-x86-ops.h          |   2 +
arch/x86/include/asm/kvm_host.h             |   5 +
arch/x86/include/asm/set_memory.h           |   2 +
arch/x86/include/asm/tdx.h                  |  22 ++
arch/x86/include/asm/tdx_global_metadata.h  |   1 +
arch/x86/kvm/mmu/mmu.c                      |  10 +
arch/x86/kvm/mmu/tdp_mmu.c                  |  47 ++++-
arch/x86/kvm/vmx/main.c                     |   2 +
arch/x86/kvm/vmx/tdx.c                      | 215 ++++++++++++++++++--
arch/x86/kvm/vmx/tdx_errno.h                |   1 +
arch/x86/kvm/vmx/x86_ops.h                  |   9 +
arch/x86/mm/Makefile                        |   2 +
arch/x86/mm/meminfo.c                       |  11 +
arch/x86/mm/pat/set_memory.c                |   2 +-
arch/x86/virt/vmx/tdx/tdx.c                 | 211 ++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx.h                 |   5 +-
arch/x86/virt/vmx/tdx/tdx_global_metadata.c |   3 +
virt/kvm/kvm_main.c                         |   1 +
18 files changed, 522 insertions(+), 29 deletions(-)
create mode 100644 arch/x86/mm/meminfo.c
[RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Kirill A. Shutemov 9 months, 1 week ago
This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
applied, but rather to receive early feedback on the feature design and
enabling.

From our perspective, this feature has a lower priority compared to huge
page support. I will rebase this patchset on top of Yan's huge page
enabling at a later time, as it requires additional work.

Any feedback is welcome. We are open to ideas.

=========================================================================

The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization.

The exact size of the required PAMT memory is determined by the TDX
module and may vary between TDX module versions, but currently it is
approximately 0.4% of the system memory. This is a significant
commitment, especially if it is not known upfront whether the machine
will run any TDX guests.

The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
PAMT_2M levels are still allocated on TDX module initialization, but the
PAMT_4K level is allocated dynamically, reducing static allocations to
approximately 0.004% of the system memory.

PAMT memory is dynamically allocated as pages gain TDX protections.
It is reclaimed when TDX protections have been removed from all
pages in a contiguous area.

TODO:
  - Rebase on top of Yan's huge page support series. Demotion requires
    additional handling with Dynamic PAMT;
  - Get better vmalloc API from core-mm and simplify patch 02/12.

Kirill A. Shutemov (12):
  x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
  x86/virt/tdx: Allocate reference counters for PAMT memory
  x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
  x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo
  KVM: TDX: Add tdx_pamt_get()/put() helpers
  KVM: TDX: Allocate PAMT memory in __tdx_td_init()
  KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
  KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to kvm_x86_ops
  KVM: TDX: Preallocate PAMT pages to be used in page fault path
  KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops
  KVM: TDX: Reclaim PAMT memory
  x86/virt/tdx: Enable Dynamic PAMT

 arch/x86/include/asm/kvm-x86-ops.h          |   2 +
 arch/x86/include/asm/kvm_host.h             |   5 +
 arch/x86/include/asm/set_memory.h           |   2 +
 arch/x86/include/asm/tdx.h                  |  22 ++
 arch/x86/include/asm/tdx_global_metadata.h  |   1 +
 arch/x86/kvm/mmu/mmu.c                      |  10 +
 arch/x86/kvm/mmu/tdp_mmu.c                  |  47 ++++-
 arch/x86/kvm/vmx/main.c                     |   2 +
 arch/x86/kvm/vmx/tdx.c                      | 215 ++++++++++++++++++--
 arch/x86/kvm/vmx/tdx_errno.h                |   1 +
 arch/x86/kvm/vmx/x86_ops.h                  |   9 +
 arch/x86/mm/Makefile                        |   2 +
 arch/x86/mm/meminfo.c                       |  11 +
 arch/x86/mm/pat/set_memory.c                |   2 +-
 arch/x86/virt/vmx/tdx/tdx.c                 | 211 ++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h                 |   5 +-
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c |   3 +
 virt/kvm/kvm_main.c                         |   1 +
 18 files changed, 522 insertions(+), 29 deletions(-)
 create mode 100644 arch/x86/mm/meminfo.c

-- 
2.47.2
Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Zhi Wang 8 months, 4 weeks ago
On Fri,  2 May 2025 16:08:16 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> This RFC patchset enables Dynamic PAMT in TDX. It is not intended to
> be applied, but rather to receive early feedback on the feature
> design and enabling.
> 
> From our perspective, this feature has a lower priority compared to
> huge page support. I will rebase this patchset on top of Yan's huge
> page enabling at a later time, as it requires additional work.
> 
> Any feedback is welcome. We are open to ideas.
> 

Do we have any estimation on how much extra cost on TVM creation/destroy
when tightly couple the PAMT allocation/de-allocation to the private
page allocation/de-allocation? It has been trendy nowadays that
meta pages are required to be given to the TSM when doing stuff with
private page in many platforms. When the pool of the meta page is
extensible/shrinkable, there are always ideas about batch pre-charge the
pool and lazy batch reclaim it at a certain path for performance
considerations or VM characteristics. That might turn into a
vendor-agnostic path in KVM with tunable configurations.

Z.

> =========================================================================
> 
> The Physical Address Metadata Table (PAMT) holds TDX metadata for
> physical memory and must be allocated by the kernel during TDX module
> initialization.
> 
> The exact size of the required PAMT memory is determined by the TDX
> module and may vary between TDX module versions, but currently it is
> approximately 0.4% of the system memory. This is a significant
> commitment, especially if it is not known upfront whether the machine
> will run any TDX guests.
> 
> The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
> PAMT_2M levels are still allocated on TDX module initialization, but
> the PAMT_4K level is allocated dynamically, reducing static
> allocations to approximately 0.004% of the system memory.
> 
> PAMT memory is dynamically allocated as pages gain TDX protections.
> It is reclaimed when TDX protections have been removed from all
> pages in a contiguous area.
> 
> TODO:
>   - Rebase on top of Yan's huge page support series. Demotion requires
>     additional handling with Dynamic PAMT;
>   - Get better vmalloc API from core-mm and simplify patch 02/12.
> 
> Kirill A. Shutemov (12):
>   x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
>   x86/virt/tdx: Allocate reference counters for PAMT memory
>   x86/virt/tdx: Add wrappers for TDH.PHYMEM.PAMT.ADD/REMOVE
>   x86/virt/tdx: Account PAMT memory and print if in /proc/meminfo
>   KVM: TDX: Add tdx_pamt_get()/put() helpers
>   KVM: TDX: Allocate PAMT memory in __tdx_td_init()
>   KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
>   KVM: x86/tdp_mmu: Add phys_prepare() and phys_cleanup() to
> kvm_x86_ops KVM: TDX: Preallocate PAMT pages to be used in page fault
> path KVM: TDX: Hookup phys_prepare() and phys_cleanup() kvm_x86_ops
>   KVM: TDX: Reclaim PAMT memory
>   x86/virt/tdx: Enable Dynamic PAMT
> 
>  arch/x86/include/asm/kvm-x86-ops.h          |   2 +
>  arch/x86/include/asm/kvm_host.h             |   5 +
>  arch/x86/include/asm/set_memory.h           |   2 +
>  arch/x86/include/asm/tdx.h                  |  22 ++
>  arch/x86/include/asm/tdx_global_metadata.h  |   1 +
>  arch/x86/kvm/mmu/mmu.c                      |  10 +
>  arch/x86/kvm/mmu/tdp_mmu.c                  |  47 ++++-
>  arch/x86/kvm/vmx/main.c                     |   2 +
>  arch/x86/kvm/vmx/tdx.c                      | 215
> ++++++++++++++++++-- arch/x86/kvm/vmx/tdx_errno.h                |
> 1 + arch/x86/kvm/vmx/x86_ops.h                  |   9 +
>  arch/x86/mm/Makefile                        |   2 +
>  arch/x86/mm/meminfo.c                       |  11 +
>  arch/x86/mm/pat/set_memory.c                |   2 +-
>  arch/x86/virt/vmx/tdx/tdx.c                 | 211 ++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h                 |   5 +-
>  arch/x86/virt/vmx/tdx/tdx_global_metadata.c |   3 +
>  virt/kvm/kvm_main.c                         |   1 +
>  18 files changed, 522 insertions(+), 29 deletions(-)
>  create mode 100644 arch/x86/mm/meminfo.c
>
Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Kirill A. Shutemov 8 months, 4 weeks ago
On Wed, May 14, 2025 at 11:33:17PM +0300, Zhi Wang wrote:
> On Fri,  2 May 2025 16:08:16 +0300
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to
> > be applied, but rather to receive early feedback on the feature
> > design and enabling.
> > 
> > From our perspective, this feature has a lower priority compared to
> > huge page support. I will rebase this patchset on top of Yan's huge
> > page enabling at a later time, as it requires additional work.
> > 
> > Any feedback is welcome. We are open to ideas.
> > 
> 
> Do we have any estimation on how much extra cost on TVM creation/destroy
> when tightly couple the PAMT allocation/de-allocation to the private
> page allocation/de-allocation? It has been trendy nowadays that
> meta pages are required to be given to the TSM when doing stuff with
> private page in many platforms. When the pool of the meta page is
> extensible/shrinkable, there are always ideas about batch pre-charge the
> pool and lazy batch reclaim it at a certain path for performance
> considerations or VM characteristics. That might turn into a
> vendor-agnostic path in KVM with tunable configurations.

It depends on the pages that the page allocator gives to TD. If memory is
not fragmented and TD receives memory from the same 2M chunks, we do not
need much PAMT memory and we do not need to make additional SEAMCALLs to
add it. It also depends on the availability of huge pages.

From my tests, a typical TD boot takes about 20 MiB of PAMT memory if no
huge pages are allocated and about 2MiB with huge pages. The overhead on
its management is negligible, especially with huge pages: approximately
256 SEAMCALLs to add PAMT pages and the same number to remove.

The consumption of PAMT memory for booting does not increase significantly
with the size of TD as the guest accepts memory lazily. However, it will
increase as more memory is accepted if huge pages are not used.

I don't think we can justify any batching here.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Dave Hansen 8 months, 3 weeks ago
On 5/15/25 02:17, Kirill A. Shutemov wrote:
> I don't think we can justify any batching here.

The is one primary goal here:

	Reduce TDX overhead when not running TDX guests.

It has the side-effect of being _able_ to reduce the amount of memory
that TDX guests use when using >=2M pages only. It has the theoretical
capability to do the same for 4k users but only when the pages are quite
contiguous.

Right?

The "not running TDX guest" and ">=2M pages" benefits are relatively
easy. The 4k one is hard and is going to take a lot more work.

Could we please focus on the easy one for now and not get distracted by
the hard one that might not even be worth it in the end?
Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Sean Christopherson 8 months, 4 weeks ago
On Fri, May 02, 2025, Kirill A. Shutemov wrote:
> This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
> applied, but rather to receive early feedback on the feature design and
> enabling.

In that case, please describe the design, and specifically *why* you chose this
particular design, along with the constraints and rules of dynamic PAMTs that
led to that decision.  It would also be very helpful to know what options you
considered and discarded, so that others don't waste time coming up with solutions
that you already rejected.

> >From our perspective, this feature has a lower priority compared to huge
> page support. I will rebase this patchset on top of Yan's huge page
> enabling at a later time, as it requires additional work.
> 
> Any feedback is welcome. We are open to ideas.
> 
> =========================================================================
> 
> The Physical Address Metadata Table (PAMT) holds TDX metadata for
> physical memory and must be allocated by the kernel during TDX module
> initialization.
> 
> The exact size of the required PAMT memory is determined by the TDX
> module and may vary between TDX module versions, but currently it is
> approximately 0.4% of the system memory. This is a significant
> commitment, especially if it is not known upfront whether the machine
> will run any TDX guests.
> 
> The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
> PAMT_2M levels are still allocated on TDX module initialization, but the
> PAMT_4K level is allocated dynamically, reducing static allocations to
> approximately 0.004% of the system memory.
> 
> PAMT memory is dynamically allocated as pages gain TDX protections.
> It is reclaimed when TDX protections have been removed from all
> pages in a contiguous area.
Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Kirill A. Shutemov 8 months, 3 weeks ago
On Wed, May 14, 2025 at 06:41:10AM -0700, Sean Christopherson wrote:
> On Fri, May 02, 2025, Kirill A. Shutemov wrote:
> > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be
> > applied, but rather to receive early feedback on the feature design and
> > enabling.
> 
> In that case, please describe the design, and specifically *why* you chose this
> particular design, along with the constraints and rules of dynamic PAMTs that
> led to that decision.  It would also be very helpful to know what options you
> considered and discarded, so that others don't waste time coming up with solutions
> that you already rejected.

Dynamic PAMT support in TDX module
==================================

Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as
needed. PAMT_1G and PAMT_2M are still allocated statically at the time of
TDX module initialization. At init stage allocation of PAMT_4K is replaced
with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k.

VMM is responsible for allocating and freeing PAMT_4K. There's a pair of
new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They
add/remove PAMT memory in form of page pair. There's no requirement for
these pages to be contiguous.

Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region.
It allows any 4K from the region to be usable by TDX module.

With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT
memory (TDX_MISSING_PAMT_PAGE_PAIR):

 - TDH.MNG.CREATE
 - TDH.MNG.ADDCX 
 - TDH.VP.ADDCX
 - TDH.VP.CREATE
 - TDH.MEM.PAGE.ADD
 - TDH.MEM.PAGE.AUG 
 - TDH.MEM.PAGE.DEMOTE
 - TDH.MEM.PAGE.RELOCATE

Basically, if you supply memory to a TD, this memory has to backed by PAMT
memory.

Once no TD uses the 2M range, the PAMT page pair can be reclaimed with
TDH.PHYMEM.PAMT.REMOVE.

TDX module track PAMT memory usage and can give VMM a hint that PAMT
memory can be removed. Such hint is provided from all SEAMCALLs that
removes memory from TD:

 - TDH.MEM.SEPT.REMOVE
 - TDH.MEM.PAGE.REMOVE
 - TDH.MEM.PAGE.PROMOTE
 - TDH.MEM.PAGE.RELOCATE
 - TDH.PHYMEM.PAGE.RECLAIM

With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional
input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer
needed PAMT page pair.

PAMT memory is global resource and not tied to a specific TD. TDX modules
maintains PAMT memory in a radix tree addressed by physical address. Each
entry in the tree can be locked with shared or exclusive lock. Any
modification of the tree requires exclusive lock.

Any SEAMCALL that takes explicit HPA as an argument will walk the tree
taking shared lock on entries. It required to make sure that the page
pointed by HPA is of compatible type for the usage.

TDCALLs don't take PAMT locks as none of the take HPA as an argument.

Dynamic PAMT enabling in kernel
===============================

Kernel maintains refcounts for every 2M regions with two helpers
tdx_pamt_get() and tdx_pamt_put().

The refcount represents number of users for the PAMT memory in the region.
Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
TDH.PHYMEM.PAMT.REMOVE on transition 1->0.

PAMT memory gets allocated as part of TD init, VCPU init, on populating
SEPT tree and adding guest memory (both during TD build and via AUG on
accept).

PAMT memory removed on reclaim of control pages and guest memory.

Populating PAMT memory on fault is tricky as we cannot allocate memory
from the context where it is needed. I introduced a pair of kvm_x86_ops to
allocate PAMT memory from a per-VCPU pool from context where VCPU is still
around and free it on failuire. This flow will likely be reworked in next
versions.

Previous attempt on Dynamic PAMT enabling
=========================================

My initial kernel enabling attempt was quite different. I wanted to make
PAMT allocation lazy: only try to add PAMT page pair if a SEAMCALL fails
due to missing PAMT and reclaim it back based on hint provided by the TDX
module.

The motivation was to avoid duplication of PAMT memory refcounting that
TDX module does on kernel side.

This approach is inherently more racy as we don't serialize PAMT memory
add/remove against SEAMCALLs that uses add/remove memory for a TD. Such
serialization would require global locking which is no-go.

I made this approach work, but at some point I realized that it cannot be
robust as long as we want to avoid TDX_OPERAND_BUSY loops.
TDX_OPERAND_BUSY will pop up as result of the races I mentioned above.

I gave up on this approach and went with the current one which uses
explicit refcounting.


Brain dumped.

Let me know if anything is unclear.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Dave Hansen 8 months, 3 weeks ago
On 5/15/25 07:22, Kirill A. Shutemov wrote:
> VMM is responsible for allocating and freeing PAMT_4K. There's a pair of
> new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They
> add/remove PAMT memory in form of page pair. There's no requirement for
> these pages to be contiguous.

BTW, that second sentence is a little goofy. Is it talking about
ADD/REMOVE being a matched pair? Or that there needs to be 8k of
metadata storage provided to each ADD/REMOVE call?

One thing I've noticed in writing changelogs and so forth is that
repetition can hurt understanding if the concepts aren't the same. Like
saying there is a "pair" of calls and a "pair" of pages when the fact
that both are pairs is a coincidence rather than an intentional and
important part of the design.
Re: [RFC, PATCH 00/12] TDX: Enable Dynamic PAMT
Posted by Kirill A. Shutemov 8 months, 3 weeks ago
On Thu, May 15, 2025 at 08:03:28AM -0700, Dave Hansen wrote:
> On 5/15/25 07:22, Kirill A. Shutemov wrote:
> > VMM is responsible for allocating and freeing PAMT_4K. There's a pair of
> > new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They
> > add/remove PAMT memory in form of page pair. There's no requirement for
> > these pages to be contiguous.
> 
> BTW, that second sentence is a little goofy. Is it talking about
> ADD/REMOVE being a matched pair? Or that there needs to be 8k of
> metadata storage provided to each ADD/REMOVE call?

Both :P

Pair of SEAMCALLs operate on pairs of pages.

> One thing I've noticed in writing changelogs and so forth is that
> repetition can hurt understanding if the concepts aren't the same. Like
> saying there is a "pair" of calls and a "pair" of pages when the fact
> that both are pairs is a coincidence rather than an intentional and
> important part of the design.

Yeah, I see it.

I will try to avoid to "pair" for SEAMCALLs in Dynamic PAMT context.
Maybe it will clear up the confusion.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov