[v2] TDX: Enable Dynamic PAMT

[PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Kirill A. Shutemov 4 months ago

This patchset enables Dynamic PAMT in TDX. Please review.

Previously, we thought it can get upstreamed after huge page support, but
huge pages require support on guestmemfd side which might take time to hit
upstream. Dynamic PAMT doesn't have dependencies.

The patchset can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt

==========================================================================

The Physical Address Metadata Table (PAMT) holds TDX metadata for
physical memory and must be allocated by the kernel during TDX module
initialization.

The exact size of the required PAMT memory is determined by the TDX
module and may vary between TDX module versions, but currently it is
approximately 0.4% of the system memory. This is a significant
commitment, especially if it is not known upfront whether the machine
will run any TDX guests.

The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
PAMT_2M levels are still allocated on TDX module initialization, but the
PAMT_4K level is allocated dynamically, reducing static allocations to
approximately 0.004% of the system memory.

PAMT memory is dynamically allocated as pages gain TDX protections.
It is reclaimed when TDX protections have been removed from all
pages in a contiguous area.

Dynamic PAMT support in TDX module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as
needed. PAMT_1G and PAMT_2M are still allocated statically at the time of
TDX module initialization. At init stage allocation of PAMT_4K is replaced
with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k.

VMM is responsible for allocating and freeing PAMT_4K. There's a couple of
new SEAMCALLs for this: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE.
They add/remove PAMT memory in form of page pair. There's no requirement
for these pages to be contiguous.

Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region.
It allows any 4K from the region to be usable by TDX module.

With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT
memory (TDX_MISSING_PAMT_PAGE_PAIR):

 - TDH.MNG.CREATE
 - TDH.MNG.ADDCX
 - TDH.VP.ADDCX
 - TDH.VP.CREATE
 - TDH.MEM.PAGE.ADD
 - TDH.MEM.PAGE.AUG
 - TDH.MEM.PAGE.DEMOTE
 - TDH.MEM.PAGE.RELOCATE

Basically, if you supply memory to a TD, this memory has to backed by PAMT
memory.

Once no TD uses the 2M range, the PAMT page pair can be reclaimed with
TDH.PHYMEM.PAMT.REMOVE.

TDX module track PAMT memory usage and can give VMM a hint that PAMT
memory can be removed. Such hint is provided from all SEAMCALLs that
removes memory from TD:

 - TDH.MEM.SEPT.REMOVE
 - TDH.MEM.PAGE.REMOVE
 - TDH.MEM.PAGE.PROMOTE
 - TDH.MEM.PAGE.RELOCATE
 - TDH.PHYMEM.PAGE.RECLAIM

With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional
input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer
needed PAMT page pair.

PAMT memory is global resource and not tied to a specific TD. TDX modules
maintains PAMT memory in a radix tree addressed by physical address. Each
entry in the tree can be locked with shared or exclusive lock. Any
modification of the tree requires exclusive lock.

Any SEAMCALL that takes explicit HPA as an argument will walk the tree
taking shared lock on entries. It required to make sure that the page
pointed by HPA is of compatible type for the usage.

TDCALLs don't take PAMT locks as none of the take HPA as an argument.

Dynamic PAMT enabling in kernel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Kernel maintains refcounts for every 2M regions with two helpers
tdx_pamt_get() and tdx_pamt_put().

The refcount represents number of users for the PAMT memory in the region.
Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
TDH.PHYMEM.PAMT.REMOVE on transition 1->0.

The function tdx_alloc_page() allocates a new page and ensures that it is
backed by PAMT memory. Pages allocated in this manner are ready to be used
for a TD. The function tdx_free_page() frees the page and releases the
PAMT memory for the 2M region if it is no longer needed.

PAMT memory gets allocated as part of TD init, VCPU init, on populating
SEPT tree and adding guest memory (both during TD build and via AUG on
accept). Splitting 2M page into 4K also requires PAMT memory.

PAMT memory removed on reclaim of control pages and guest memory.

Populating PAMT memory on fault and on split is tricky as kernel cannot
allocate memory from the context where it is needed. These code paths use
pre-allocated PAMT memory pools.

Previous attempt on Dynamic PAMT enabling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The initial attempt at kernel enabling was quite different. It was built
around lazy PAMT allocation: only trying to add a PAMT page pair if a
SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
provided by the TDX module.

The motivation was to avoid duplicating the PAMT memory refcounting that
the TDX module does on the kernel side.

This approach is inherently more racy as there is no serialization of
PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
Such serialization would require global locking, which is not feasible.

This approach worked, but at some point it became clear that it could not
be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
TDX_OPERAND_BUSY will occur as a result of the races mentioned above.

This approach was abandoned in favor of explicit refcounting.

v2:
 - Drop phys_prepare/clenup. Use kvm_get_running_vcpu() to reach per-VCPU PAMT
   memory pool from TDX code instead.
 - Move code that allocates/frees PAMT out of KVM;
 - Allocate refcounts per-memblock, not per-TDMR;
 - Fix free_pamt_metadata() for machines without Dynamic PAMT;
 - Fix refcounting in tdx_pamt_put() error path;
 - Export functions where they are used;
 - Consolidate TDX error handling code;
 - Add documentation for Dynamic PAMT;
 - Mark /proc/meminfo patch [NOT-FOR-UPSTREAM];
Kirill A. Shutemov (12):
  x86/tdx: Consolidate TDX error handling
  x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
  x86/virt/tdx: Allocate reference counters for PAMT memory
  x86/virt/tdx: Add tdx_alloc/free_page() helpers
  KVM: TDX: Allocate PAMT memory in __tdx_td_init()
  KVM: TDX: Allocate PAMT memory in tdx_td_vcpu_init()
  KVM: TDX: Preallocate PAMT pages to be used in page fault path
  KVM: TDX: Handle PAMT allocation in fault path
  KVM: TDX: Reclaim PAMT memory
  [NOT-FOR-UPSTREAM] x86/virt/tdx: Account PAMT memory and print it in
    /proc/meminfo
  x86/virt/tdx: Enable Dynamic PAMT
  Documentation/x86: Add documentation for TDX's Dynamic PAMT

 Documentation/arch/x86/tdx.rst              | 108 ++++++
 arch/x86/coco/tdx/tdx.c                     |   6 +-
 arch/x86/include/asm/kvm_host.h             |   2 +
 arch/x86/include/asm/set_memory.h           |   3 +
 arch/x86/include/asm/tdx.h                  |  40 ++-
 arch/x86/include/asm/tdx_errno.h            |  96 +++++
 arch/x86/include/asm/tdx_global_metadata.h  |   1 +
 arch/x86/kvm/mmu/mmu.c                      |   7 +
 arch/x86/kvm/vmx/tdx.c                      | 102 ++++--
 arch/x86/kvm/vmx/tdx.h                      |   1 -
 arch/x86/kvm/vmx/tdx_errno.h                |  40 ---
 arch/x86/mm/Makefile                        |   2 +
 arch/x86/mm/meminfo.c                       |  11 +
 arch/x86/mm/pat/set_memory.c                |   2 +-
 arch/x86/virt/vmx/tdx/tdx.c                 | 380 +++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h                 |   5 +-
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c |   3 +
 virt/kvm/kvm_main.c                         |   1 +
 18 files changed, 702 insertions(+), 108 deletions(-)
 create mode 100644 arch/x86/include/asm/tdx_errno.h
 delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h
 create mode 100644 arch/x86/mm/meminfo.c

-- 
2.47.2

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Dave Hansen 1 month ago

On 6/9/25 12:13, Kirill A. Shutemov wrote:
> The exact size of the required PAMT memory is determined by the TDX
> module and may vary between TDX module versions, but currently it is
> approximately 0.4% of the system memory. This is a significant
> commitment, especially if it is not known upfront whether the machine
> will run any TDX guests.
> 
> The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
> PAMT_2M levels are still allocated on TDX module initialization, but the
> PAMT_4K level is allocated dynamically, reducing static allocations to
> approximately 0.004% of the system memory.

I'm beginning to think that this series is barking up the wrong tree.

>  18 files changed, 702 insertions(+), 108 deletions(-)

I totally agree that saving 0.4% of memory on a gagillion systems saves
a gagillion dollars.

But this series seems to be marching down the path that the savings
needs to be down at page granularity: If a 2M page doesn't need PAMT
then it strictly shouldn't have any PAMT. While that's certainly a
high-utiltiy tact, I can't help but think it may be over complicated.

What if we just focused on three states:

1. System boots, has no DPAMT.
2. First TD starts up, all DPAMT gets allocated
3. Last TD shuts down, all DPAMT gets freed

The cases that leaves behind are when the system has a small number of
TDs packed into a relatively small number of 2M pages. That occurs
either because they're backing with real huge pages or that they are
backed with 4k and nicely compacted because memory wasn't fragmented.

I know our uberscaler buddies are quite fond of those cases and want to
minimize memory use. But are you folks really going to have that many
systems which deploy a very small number of small TDs?

In other words, can we simplify this? Or at least _start_ simpler with v1?

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Kiryl Shutsemau 1 month ago

On Mon, Sep 08, 2025 at 12:17:58PM -0700, Dave Hansen wrote:
> On 6/9/25 12:13, Kirill A. Shutemov wrote:
> > The exact size of the required PAMT memory is determined by the TDX
> > module and may vary between TDX module versions, but currently it is
> > approximately 0.4% of the system memory. This is a significant
> > commitment, especially if it is not known upfront whether the machine
> > will run any TDX guests.
> > 
> > The Dynamic PAMT feature reduces static PAMT allocations. PAMT_1G and
> > PAMT_2M levels are still allocated on TDX module initialization, but the
> > PAMT_4K level is allocated dynamically, reducing static allocations to
> > approximately 0.004% of the system memory.
> 
> I'm beginning to think that this series is barking up the wrong tree.
> 
> >  18 files changed, 702 insertions(+), 108 deletions(-)
> 
> I totally agree that saving 0.4% of memory on a gagillion systems saves
> a gagillion dollars.
> 
> But this series seems to be marching down the path that the savings
> needs to be down at page granularity: If a 2M page doesn't need PAMT
> then it strictly shouldn't have any PAMT. While that's certainly a
> high-utiltiy tact, I can't help but think it may be over complicated.
> 
> What if we just focused on three states:
> 
> 1. System boots, has no DPAMT.
> 2. First TD starts up, all DPAMT gets allocated
> 3. Last TD shuts down, all DPAMT gets freed
> 
> The cases that leaves behind are when the system has a small number of
> TDs packed into a relatively small number of 2M pages. That occurs
> either because they're backing with real huge pages or that they are
> backed with 4k and nicely compacted because memory wasn't fragmented.
> 
> I know our uberscaler buddies are quite fond of those cases and want to
> minimize memory use. But are you folks really going to have that many
> systems which deploy a very small number of small TDs?
> 
> In other words, can we simplify this? Or at least _start_ simpler with v1?

You cannot give all DPAMT memory to TDX module at the start of the first
TD and forget about it. Well you can, if you give up ever mapping guest
memory with 2M pages.

Dynamic PAMT pages are stored into PAMT_2M entry and you cannot have 2M
page and have Dynamic 4K entries stored there at the same time.

You would need to handle at least for promotion/demotion and stash this
memory somewhere while 2M pages used.

And it is going to be very wasteful. With huge pages, in most cases, you
only need dynamic PAMT for control pages. You will have a lot of memory
sitting in stash with zero use.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Dave Hansen 1 month ago

On 9/9/25 04:16, Kiryl Shutsemau wrote:
> Dynamic PAMT pages are stored into PAMT_2M entry and you cannot have 2M
> page and have Dynamic 4K entries stored there at the same time.

That sounds like a TDX module implementation bug to me.

Worst possible case, the TDX module could double the
'sysinfo_tdmr->pamt_2m_entry_size' and use the other half for more pointers.

> And it is going to be very wasteful. With huge pages, in most cases, you
> only need dynamic PAMT for control pages. You will have a lot of memory
> sitting in stash with zero use.

I think it's going to be hard to convince me without actual data on this
one.

Even then, we're talking about 0.4% of system memory. So how much code
and complexity are we talking about in order to save a *maximum* of 0.4%
of system memory?

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Vishal Annapurve 1 month ago

On Tue, Sep 9, 2025 at 8:24 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/9/25 04:16, Kiryl Shutsemau wrote:
>
> > And it is going to be very wasteful. With huge pages, in most cases, you
> > only need dynamic PAMT for control pages. You will have a lot of memory
> > sitting in stash with zero use.
>
> I think it's going to be hard to convince me without actual data on this
> one.
>

* With 1G page backing and with DPAMT entries created only for 4K EPT mappings
- ~5MB of DPAMT memory usage for 704G guest memory size. We expect the
DPAMT memory usage to be in MBs even with 4096G guest memory size.

* With DPAMT entries created for all private memory irrespective of
mapping granularity
- DPAMT memory usage is around ~3GB for 704G guest memory size and
around ~16G for 4096G guest memory size.

For a 4TB guest memory size with 1G page backing, the DPAMT memory usage

> Even then, we're talking about 0.4% of system memory. So how much code
> and complexity are we talking about in order to save a *maximum* of 0.4%
> of system memory?
>

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 2 months ago

On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> Dynamic PAMT enabling in kernel
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Kernel maintains refcounts for every 2M regions with two helpers
> tdx_pamt_get() and tdx_pamt_put().
> 
> The refcount represents number of users for the PAMT memory in the region.
> Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
> TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
> 
> The function tdx_alloc_page() allocates a new page and ensures that it is
> backed by PAMT memory. Pages allocated in this manner are ready to be used
> for a TD. The function tdx_free_page() frees the page and releases the
> PAMT memory for the 2M region if it is no longer needed.
> 
> PAMT memory gets allocated as part of TD init, VCPU init, on populating
> SEPT tree and adding guest memory (both during TD build and via AUG on
> accept). Splitting 2M page into 4K also requires PAMT memory.
> 
> PAMT memory removed on reclaim of control pages and guest memory.
> 
> Populating PAMT memory on fault and on split is tricky as kernel cannot
> allocate memory from the context where it is needed. These code paths use
> pre-allocated PAMT memory pools.
> 
> Previous attempt on Dynamic PAMT enabling
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The initial attempt at kernel enabling was quite different. It was built
> around lazy PAMT allocation: only trying to add a PAMT page pair if a
> SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
> provided by the TDX module.
> 
> The motivation was to avoid duplicating the PAMT memory refcounting that
> the TDX module does on the kernel side.
> 
> This approach is inherently more racy as there is no serialization of
> PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
> Such serialization would require global locking, which is not feasible.
> 
> This approach worked, but at some point it became clear that it could not
> be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
> TDX_OPERAND_BUSY will occur as a result of the races mentioned above.
> 
> This approach was abandoned in favor of explicit refcounting.

On closer inspection this new solution also has global locking. It
opportunistically checks to see if there is already a refcount, but otherwise
when a PAMT page actually has to be added there is a global spinlock while the
PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere
around once per 512 4k private pages, but when it does it has some less ideal
properties:
 - Cache line bouncing of the lock between all TDs on the host
 - An global exclusive lock deep inside the TDP MMU shared lock fault path
 - Contend heavily when two TD's shutting down at the same time?

As for why not only do the lock as a backup option like the kick+lock solution
in KVM, the problem would be losing the refcount race and ending up with a PAMT
page getting released early.

As far as TDX module locking is concerned (i.e. BUSY error codes from pamt
add/remove), it seems this would only happen if pamt add/remove operate
simultaneously on the same 2MB HPA region. That is completely prevented by the
refcount and global lock, but it's a bit heavyweight. It prevents simultaneously
adding totally separate 2MB regions when we only would need to prevent
simultaneously operating on the same 2MB region.

I don't see any other reason for the global spin lock, Kirill was that it? Did
you consider also adding a lock per 2MB region, like the refcount? Or any other
granularity of lock besides global? Not saying global is definitely the wrong
choice, but seems arbitrary if I got the above right.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by kas@kernel.org 2 months ago

On Fri, Aug 08, 2025 at 11:18:40PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> > Dynamic PAMT enabling in kernel
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > Kernel maintains refcounts for every 2M regions with two helpers
> > tdx_pamt_get() and tdx_pamt_put().
> > 
> > The refcount represents number of users for the PAMT memory in the region.
> > Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
> > TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
> > 
> > The function tdx_alloc_page() allocates a new page and ensures that it is
> > backed by PAMT memory. Pages allocated in this manner are ready to be used
> > for a TD. The function tdx_free_page() frees the page and releases the
> > PAMT memory for the 2M region if it is no longer needed.
> > 
> > PAMT memory gets allocated as part of TD init, VCPU init, on populating
> > SEPT tree and adding guest memory (both during TD build and via AUG on
> > accept). Splitting 2M page into 4K also requires PAMT memory.
> > 
> > PAMT memory removed on reclaim of control pages and guest memory.
> > 
> > Populating PAMT memory on fault and on split is tricky as kernel cannot
> > allocate memory from the context where it is needed. These code paths use
> > pre-allocated PAMT memory pools.
> > 
> > Previous attempt on Dynamic PAMT enabling
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > The initial attempt at kernel enabling was quite different. It was built
> > around lazy PAMT allocation: only trying to add a PAMT page pair if a
> > SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
> > provided by the TDX module.
> > 
> > The motivation was to avoid duplicating the PAMT memory refcounting that
> > the TDX module does on the kernel side.
> > 
> > This approach is inherently more racy as there is no serialization of
> > PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
> > Such serialization would require global locking, which is not feasible.
> > 
> > This approach worked, but at some point it became clear that it could not
> > be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
> > TDX_OPERAND_BUSY will occur as a result of the races mentioned above.
> > 
> > This approach was abandoned in favor of explicit refcounting.
> 
> On closer inspection this new solution also has global locking. It
> opportunistically checks to see if there is already a refcount, but otherwise
> when a PAMT page actually has to be added there is a global spinlock while the
> PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere
> around once per 512 4k private pages, but when it does it has some less ideal
> properties:
>  - Cache line bouncing of the lock between all TDs on the host
>  - An global exclusive lock deep inside the TDP MMU shared lock fault path
>  - Contend heavily when two TD's shutting down at the same time?
> 
> As for why not only do the lock as a backup option like the kick+lock solution
> in KVM, the problem would be losing the refcount race and ending up with a PAMT
> page getting released early.
> 
> As far as TDX module locking is concerned (i.e. BUSY error codes from pamt
> add/remove), it seems this would only happen if pamt add/remove operate
> simultaneously on the same 2MB HPA region. That is completely prevented by the
> refcount and global lock, but it's a bit heavyweight. It prevents simultaneously
> adding totally separate 2MB regions when we only would need to prevent
> simultaneously operating on the same 2MB region.
> 
> I don't see any other reason for the global spin lock, Kirill was that it? Did
> you consider also adding a lock per 2MB region, like the refcount? Or any other
> granularity of lock besides global? Not saying global is definitely the wrong
> choice, but seems arbitrary if I got the above right.

We have discussed this before[1]. Global locking is problematic when you
actually hit contention. Let's not complicate things until we actually
see it. I failed to demonstrate contention without huge pages. With huge
pages it is even more dubious that we ever see it.

[1] https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/

-- 
Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 2 months ago

On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote:
> > I don't see any other reason for the global spin lock, Kirill was that it?
> > Did
> > you consider also adding a lock per 2MB region, like the refcount? Or any
> > other
> > granularity of lock besides global? Not saying global is definitely the
> > wrong
> > choice, but seems arbitrary if I got the above right.
> 
> We have discussed this before[1]. Global locking is problematic when you
> actually hit contention. Let's not complicate things until we actually
> see it. I failed to demonstrate contention without huge pages. With huge
> pages it is even more dubious that we ever see it.
> 
> [1]
> https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/

Ah, I see.

I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge
pages) and then shutting them down. I saw 701 contentions on startup, and 53
more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
But it probably wasn't much of a cacheline bouncing worse case. And I guess this
is on my latest changes not this exact v2, but it shouldn't have changed.

But hmm, it seems Dave's objection about maintaining the lock allocations would
apply to the refcounts too? But the hotplug concerns shouldn't actually be an
issue for TDX because they gets rejected if the allocations are not already
there. So complexity of a per-2MB lock should be minimal, at least
incrementally. The difference seems more about memory use vs performance.

What gives me pause is in the KVM TDX work we have really tried hard to not take
exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
hard numbers. But an enormous amount of work went into lettings KVM faults
happen under the shared lock for normal VMs. So on one hand, yes it's premature
optimization. But on the other hand, it's a maintainability concern about
polluting the existing way things work in KVM with special TDX properties.

I think we need to at least call out loudly that the decision was to go with the
simplest possible solution, and the impact to KVM. I'm not sure what Sean's
opinion is, but I wouldn't want him to first learn of it when he went digging
and found a buried global spin lock in the fault path.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Sean Christopherson 2 months ago

On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote:
> > > I don't see any other reason for the global spin lock, Kirill was that
> > > it?  Did you consider also adding a lock per 2MB region, like the
> > > refcount? Or any other granularity of lock besides global? Not saying
> > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > above right.
> > 
> > We have discussed this before[1]. Global locking is problematic when you
> > actually hit contention. Let's not complicate things until we actually
> > see it. I failed to demonstrate contention without huge pages. With huge
> > pages it is even more dubious that we ever see it.
> > 
> > [1]
> > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
> 
> Ah, I see.
> 
> I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge

How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?

There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
private<=>shared conversions and thus interferes with PAMT allocations for other
VMs.

> pages) and then shutting them down. I saw 701 contentions on startup, and 53
> more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> But it probably wasn't much of a cacheline bouncing worse case.

Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
SEAMCALL is easily the long pole in the flow.

> And I guess this is on my latest changes not this exact v2, but it shouldn't
> have changed.
> 
> But hmm, it seems Dave's objection about maintaining the lock allocations would
> apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> issue for TDX because they gets rejected if the allocations are not already
> there. So complexity of a per-2MB lock should be minimal, at least
> incrementally. The difference seems more about memory use vs performance.
> 
> What gives me pause is in the KVM TDX work we have really tried hard to not take
> exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> hard numbers.

Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
that will always hold true.

> But an enormous amount of work went into lettings KVM faults happen under the
> shared lock for normal VMs. So on one hand, yes it's premature optimization.
> But on the other hand, it's a maintainability concern about polluting the
> existing way things work in KVM with special TDX properties.
> 
> I think we need to at least call out loudly that the decision was to go with the
> simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> opinion is, but I wouldn't want him to first learn of it when he went digging
> and found a buried global spin lock in the fault path.

Heh, too late, I saw it when this was first posted.  And to be honest, my initial
reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
had time to think things through, I'm not _totally_ opposed to having a spinlock
in the page fault path, but my overall sentiment remains the same.

For mmu_lock and related SPTE operations, I was super adamant about not taking
exclusive locks because based on our experience with the TDP MMU, converting flows
from exclusive to shared is usually significantly more work than developing code
for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
And importantly, those code paths were largely solved problems.  I.e. I didn't
want to get into a situation where TDX undid the parallelization of the TDP MMU,
and then had to add it back after the fact.

I think the same holds true here.  I'm not completely opposed to introducing a
spinlock, but I want to either have a very high level of confidence that the lock
won't introduce jitter/delay (I have low confidence on this front, at least in
the proposed patches), or have super clear line of sight to making the contention
irrelevant, without having to rip apart the code.

My biggest question at this point is: why is all of this being done on-demand?
IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K
pages at the last possible moment".  Neither of those seems ideal.

E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand
PAMT management seems reasonable.  But for PAMTs that are used to track guest-assigned
memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd?
I.e. setup PAMT crud when guest_memfd is populated, not when the memory is mapped
into the guest.  That way setups that cares about guest boot time can preallocate
guest_memfd in order to get the PAMT stuff out of the way.

You could do the same thing by prefaulting guest memory, but TDX has limitations
there, and I see very little value in precisely reclaiming PAMT memory when a
leaf S-EPT is zapped, i.e. when a page is converted from private=>shared.  As
above, that's just asking for noisy neighbor issues.

The complaints with static PAMT are that it required burning 0.4% of memory even
if the host isn't actively running TDX VMs.  Burning 0.4% of the memory assigned
to a guest, regardless of whether it's map private or shared, seems acceptable,
and I think would give us a lot more flexibility in avoiding locking issues.

Similarly, we could bind a PAMT to non-leaf S-EPT pages during mmu_topup_memory_caches(),
i.e. when arch.mmu_external_spt_cache is filled.  Then there would be no need for
a separate vcpu->arch.pamt_page_cache, and more work would be done outside of
mmu_lock.  Freeing SPTs would still be done under mmu_lock (I think), but that
should be a much rarer operation.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 4 weeks ago

On Mon, 2025-08-11 at 19:02 -0700, Sean Christopherson wrote:
> > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non
> > huge
> 
> How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?

4 vCPUs.

I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory
from 4 threads until OOM, then shutdown. TDs were split between two sockets. It
ended up with 1136 contentions of the global lock, 4ms waiting.

It still feels very wrong, but also not something that is a very measurable in
the real world. Like Kirill was saying, the global lock is not held very long.
I'm not sure if this may still hit scalability issues from cacheline bouncing on
bigger multisocket systems. But we do have a path forwards if we hit this.
Depending on the scale of the problem that comes up we could decide whether to
do the lock per-2MB region with more memory usage, or a hashed table of N locks
like Dave suggested.

So I'll plan to keep the existing single lock for now unless anyone has any
strong objections.

> 
> There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
> private<=>shared conversions and thus interferes with PAMT allocations for
> other
> VMs.

Hmm, as long as it doesn't block it completely, it seems ok?

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Dave Hansen 1 month, 4 weeks ago

On 8/13/25 15:43, Edgecombe, Rick P wrote:
> I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory
> from 4 threads until OOM, then shutdown. TDs were split between two sockets. It
> ended up with 1136 contentions of the global lock, 4ms waiting.

4ms out of how much CPU time?

Also, contention is *NOT* necessarily bad here. Only _false_ contention.

The whole point of the lock is to ensure that there aren't two different
CPUs trying to do two different things to the same PAMT range at the
same time.

If there are, one of them *HAS* to wait. It can wait lots of different
ways, but it has to wait. That wait will show up as spinlock contention.

Even if the global lock went away, that 4ms of spinning might still be
there.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 4 weeks ago

On Wed, 2025-08-13 at 16:31 -0700, Dave Hansen wrote:
> On 8/13/25 15:43, Edgecombe, Rick P wrote:
> > I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory
> > from 4 threads until OOM, then shutdown. TDs were split between two sockets. It
> > ended up with 1136 contentions of the global lock, 4ms waiting.
> 
> 4ms out of how much CPU time?

The whole test took about 60s wall time (minus the time of some manual steps).
I'll have to automate it a bit more. But 4ms seemed safely in the "small"
category.

> 
> Also, contention is *NOT* necessarily bad here. Only _false_ contention.
> 
> The whole point of the lock is to ensure that there aren't two different
> CPUs trying to do two different things to the same PAMT range at the
> same time.
> 
> If there are, one of them *HAS* to wait. It can wait lots of different
> ways, but it has to wait. That wait will show up as spinlock contention.
> 
> Even if the global lock went away, that 4ms of spinning might still be
> there.

I assumed it was mostly real contention because the the refcount check outside
the lock should prevent the majority of "two threads operating on the same 2MB
region" collisions. The code is roughly:

1:
   if (atomic_inc_not_zero(2mb_pamt_refcount))
	return <it's mapped>;
2:
   <global lock>
   if (atomic_read(2mb_pamt_refcount) != 0) {
3:
	atomic_inc(2mb_pamt_refcount);
	<global unlock>
	return <it's mapped>;
   }
   <seamcall>
   <global unlock>
4:

(similar pattern on the unmapping)

So it will only be valid contention if two threads try to fault in the *same* 2MB
DPAMT region *and* lose that race around 1-3, but invalid contention if threads try
to execute 2-4 at the same time for any different 2MB regions.

Let me go verify.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Kiryl Shutsemau 1 month, 4 weeks ago

On Thu, Aug 14, 2025 at 12:14:40AM +0000, Edgecombe, Rick P wrote:
> On Wed, 2025-08-13 at 16:31 -0700, Dave Hansen wrote:
> > On 8/13/25 15:43, Edgecombe, Rick P wrote:
> > > I redid the test. Boot 10 TDs with 16GB of ram, run userspace to fault in memory
> > > from 4 threads until OOM, then shutdown. TDs were split between two sockets. It
> > > ended up with 1136 contentions of the global lock, 4ms waiting.
> > 
> > 4ms out of how much CPU time?
> 
> The whole test took about 60s wall time (minus the time of some manual steps).
> I'll have to automate it a bit more. But 4ms seemed safely in the "small"
> category.
> 
> > 
> > Also, contention is *NOT* necessarily bad here. Only _false_ contention.
> > 
> > The whole point of the lock is to ensure that there aren't two different
> > CPUs trying to do two different things to the same PAMT range at the
> > same time.
> > 
> > If there are, one of them *HAS* to wait. It can wait lots of different
> > ways, but it has to wait. That wait will show up as spinlock contention.
> > 
> > Even if the global lock went away, that 4ms of spinning might still be
> > there.
> 
> I assumed it was mostly real contention because the the refcount check outside
> the lock should prevent the majority of "two threads operating on the same 2MB
> region" collisions. The code is roughly:
> 
> 1:
>    if (atomic_inc_not_zero(2mb_pamt_refcount))
> 	return <it's mapped>;
> 2:
>    <global lock>
>    if (atomic_read(2mb_pamt_refcount) != 0) {
> 3:
> 	atomic_inc(2mb_pamt_refcount);
> 	<global unlock>
> 	return <it's mapped>;
>    }
>    <seamcall>
>    <global unlock>
> 4:
> 
> (similar pattern on the unmapping)
> 
> So it will only be valid contention if two threads try to fault in the *same* 2MB
> DPAMT region *and* lose that race around 1-3, but invalid contention if threads try
> to execute 2-4 at the same time for any different 2MB regions.
> 
> Let me go verify.

Note that in absence of the global lock here, concurrent PAMT.ADD would
also trigger some cache bouncing during pamt_walk() on taking shared
lock on 1G PAMT entry and exclusive lock on 2M entries in the same
cache (4 PAMT_2M entries per cache line). This is hidden by the global
lock.

You would not recover full contention time by removing the global lock.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 3 weeks ago

On Thu, 2025-08-14 at 11:55 +0100, Kiryl Shutsemau wrote:
> > > > (similar pattern on the unmapping)
> > > > 
> > > > So it will only be valid contention if two threads try to fault in the >
> > > > > *same* 2MB
> > > > DPAMT region *and* lose that race around 1-3, but invalid contention if
> > > > > > threads try
> > > > to execute 2-4 at the same time for any different 2MB regions.
> > > > 
> > > > Let me go verify.

It lost the race only once over a couple runs. So it seems mostly invalid
contention.

> > 
> > Note that in absence of the global lock here, concurrent PAMT.ADD would
> > also trigger some cache bouncing during pamt_walk() on taking shared
> > lock on 1G PAMT entry and exclusive lock on 2M entries in the same
> > cache (4 PAMT_2M entries per cache line). This is hidden by the global
> > lock.
> > 
> > You would not recover full contention time by removing the global lock.

Hmm, yea. Another consideration is that performance sensitive users will
probably be using huge pages, in which case 4k PAMT will be mostly skipped.

But man, the number and complexity of the locks is getting a bit high across the
whole stack. I don't have any easy ideas.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Sean Christopherson 1 month, 3 weeks ago

On Fri, Aug 15, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-14 at 11:55 +0100, Kiryl Shutsemau wrote:
> > > > > (similar pattern on the unmapping)
> > > > > 
> > > > > So it will only be valid contention if two threads try to fault in the >
> > > > > > *same* 2MB
> > > > > DPAMT region *and* lose that race around 1-3, but invalid contention if
> > > > > > > threads try
> > > > > to execute 2-4 at the same time for any different 2MB regions.
> > > > > 
> > > > > Let me go verify.
> 
> It lost the race only once over a couple runs. So it seems mostly invalid
> contention.
> 
> > > 
> > > Note that in absence of the global lock here, concurrent PAMT.ADD would
> > > also trigger some cache bouncing during pamt_walk() on taking shared
> > > lock on 1G PAMT entry and exclusive lock on 2M entries in the same
> > > cache (4 PAMT_2M entries per cache line). This is hidden by the global
> > > lock.
> > > 
> > > You would not recover full contention time by removing the global lock.
> 
> Hmm, yea. Another consideration is that performance sensitive users will
> probably be using huge pages, in which case 4k PAMT will be mostly skipped.
> 
> But man, the number and complexity of the locks is getting a bit high across the
> whole stack. I don't have any easy ideas.

FWIW, I'm not concerned about bouncing cachelines, I'm concerned about the cost
of the SEAMCALLs.  The latency due to bouncing a cache line due to "false"
contention is probably in the noise compared to waiting thousands of cycles for
other SEAMCALLs to complete.

That's also my concern with tying PAMT management to S-EPT population.  E.g. if
a use case triggers a decent amount S-EPT churn, then dynamic PAMT support will
exacerbate the S-EPT overhead.

But IIUC, that's a limitation of the TDX-Module design, i.e. there's no way to
hand it a pool of PAMT pages to manage.  And I suppose if a use case is churning
S-EPT, then it's probably going to be sad no matter what.  So, as long as the KVM
side of things isn't completely awful, I can live with on-demand PAMT management.

As for the global lock, I don't really care what we go with for initial support,
just so long as there's clear line of sight to an elegant solution _if_ we need
shard the lock.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 3 weeks ago

On Wed, 2025-08-20 at 08:31 -0700, Sean Christopherson wrote:
> > But man, the number and complexity of the locks is getting a bit high across
> > the whole stack. I don't have any easy ideas.
> 
> FWIW, I'm not concerned about bouncing cachelines, I'm concerned about the
> cost of the SEAMCALLs.  The latency due to bouncing a cache line due to
> "false" contention is probably in the noise compared to waiting thousands of
> cycles for other SEAMCALLs to complete.
> 
> That's also my concern with tying PAMT management to S-EPT population.  E.g.
> if a use case triggers a decent amount S-EPT churn, then dynamic PAMT support
> will exacerbate the S-EPT overhead.

I confirmed matching the page size is currently required. Having it work with
mismatched page sizes was considered, but assessed to require more memory use.
As in more pages needed per 2MB region, not just more memory usage due to the
pre-allocation of all memory. We can do it if we prefer the simplicity over
memory usage.

> 
> But IIUC, that's a limitation of the TDX-Module design, i.e. there's no way to
> hand it a pool of PAMT pages to manage.  And I suppose if a use case is
> churning S-EPT, then it's probably going to be sad no matter what.  So, as
> long as the KVM side of things isn't completely awful, I can live with on-
> demand PAMT management.
> 
> As for the global lock, I don't really care what we go with for initial
> support, just so long as there's clear line of sight to an elegant solution
> _if_ we need shard the lock.

Ok, I'll leave it and we can look at whether the KVM side is simple enough.
Thanks for circling back.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by kas@kernel.org 2 months ago

On Mon, Aug 11, 2025 at 07:02:10PM -0700, Sean Christopherson wrote:
> On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote:
> > > > I don't see any other reason for the global spin lock, Kirill was that
> > > > it?  Did you consider also adding a lock per 2MB region, like the
> > > > refcount? Or any other granularity of lock besides global? Not saying
> > > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > > above right.
> > > 
> > > We have discussed this before[1]. Global locking is problematic when you
> > > actually hit contention. Let's not complicate things until we actually
> > > see it. I failed to demonstrate contention without huge pages. With huge
> > > pages it is even more dubious that we ever see it.
> > > 
> > > [1]
> > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
> > 
> > Ah, I see.
> > 
> > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge
> 
> How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?
> 
> There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
> private<=>shared conversions and thus interferes with PAMT allocations for other
> VMs.
> 
> > pages) and then shutting them down. I saw 701 contentions on startup, and 53
> > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> > But it probably wasn't much of a cacheline bouncing worse case.
> 
> Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
> SEAMCALL is easily the long pole in the flow.
> 
> > And I guess this is on my latest changes not this exact v2, but it shouldn't
> > have changed.
> > 
> > But hmm, it seems Dave's objection about maintaining the lock allocations would
> > apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> > issue for TDX because they gets rejected if the allocations are not already
> > there. So complexity of a per-2MB lock should be minimal, at least
> > incrementally. The difference seems more about memory use vs performance.

I don't see jump to per-2MB locking remotely justified. We can scale
number of locks gradually with the amount of memory in the system: have
a power-of-2 set of locks and 2MB range to the lock with %.

Note that it is trivial thing to add later on and doesn't need to be
part of initial design.

> > What gives me pause is in the KVM TDX work we have really tried hard to not take
> > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> > hard numbers.
> 
> Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
> for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
> as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
> that will always hold true.
>  
> > But an enormous amount of work went into lettings KVM faults happen under the
> > shared lock for normal VMs. So on one hand, yes it's premature optimization.
> > But on the other hand, it's a maintainability concern about polluting the
> > existing way things work in KVM with special TDX properties.
> > 
> > I think we need to at least call out loudly that the decision was to go with the
> > simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> > opinion is, but I wouldn't want him to first learn of it when he went digging
> > and found a buried global spin lock in the fault path.
> 
> Heh, too late, I saw it when this was first posted.  And to be honest, my initial
> reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
> had time to think things through, I'm not _totally_ opposed to having a spinlock
> in the page fault path, but my overall sentiment remains the same.
> 
> For mmu_lock and related SPTE operations, I was super adamant about not taking
> exclusive locks because based on our experience with the TDP MMU, converting flows
> from exclusive to shared is usually significantly more work than developing code
> for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
> And importantly, those code paths were largely solved problems.  I.e. I didn't
> want to get into a situation where TDX undid the parallelization of the TDP MMU,
> and then had to add it back after the fact.
> 
> I think the same holds true here.  I'm not completely opposed to introducing a
> spinlock, but I want to either have a very high level of confidence that the lock
> won't introduce jitter/delay (I have low confidence on this front, at least in
> the proposed patches), or have super clear line of sight to making the contention
> irrelevant, without having to rip apart the code.

I think there is a big difference with mmu_lock.

mmu_lock is analogous to mmap_lock in core-mm. It serializes page fault
against other mmu operation and have inherently vast scope.

pamt_lock on other hand is at very bottom of callchain and with very
limited scope. It is trivially scalable by partitioning.

Translating problems you see with mmu_lock onto pamt_lock seems like an
overreaction.

-- 
Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Vishal Annapurve 2 months ago

On Mon, Aug 11, 2025 at 7:02 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote:
> > > > I don't see any other reason for the global spin lock, Kirill was that
> > > > it?  Did you consider also adding a lock per 2MB region, like the
> > > > refcount? Or any other granularity of lock besides global? Not saying
> > > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > > above right.
> > >
> > > We have discussed this before[1]. Global locking is problematic when you
> > > actually hit contention. Let's not complicate things until we actually
> > > see it. I failed to demonstrate contention without huge pages. With huge
> > > pages it is even more dubious that we ever see it.
> > >
> > > [1]
> > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
> >
> > Ah, I see.
> >
> > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge
>
> How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?
>
> There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
> private<=>shared conversions and thus interferes with PAMT allocations for other
> VMs.
>
> > pages) and then shutting them down. I saw 701 contentions on startup, and 53
> > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> > But it probably wasn't much of a cacheline bouncing worse case.
>
> Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
> SEAMCALL is easily the long pole in the flow.
>
> > And I guess this is on my latest changes not this exact v2, but it shouldn't
> > have changed.
> >
> > But hmm, it seems Dave's objection about maintaining the lock allocations would
> > apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> > issue for TDX because they gets rejected if the allocations are not already
> > there. So complexity of a per-2MB lock should be minimal, at least
> > incrementally. The difference seems more about memory use vs performance.
> >
> > What gives me pause is in the KVM TDX work we have really tried hard to not take
> > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> > hard numbers.
>
> Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
> for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
> as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
> that will always hold true.
>
> > But an enormous amount of work went into lettings KVM faults happen under the
> > shared lock for normal VMs. So on one hand, yes it's premature optimization.
> > But on the other hand, it's a maintainability concern about polluting the
> > existing way things work in KVM with special TDX properties.
> >
> > I think we need to at least call out loudly that the decision was to go with the
> > simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> > opinion is, but I wouldn't want him to first learn of it when he went digging
> > and found a buried global spin lock in the fault path.
>
> Heh, too late, I saw it when this was first posted.  And to be honest, my initial
> reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
> had time to think things through, I'm not _totally_ opposed to having a spinlock
> in the page fault path, but my overall sentiment remains the same.
>
> For mmu_lock and related SPTE operations, I was super adamant about not taking
> exclusive locks because based on our experience with the TDP MMU, converting flows
> from exclusive to shared is usually significantly more work than developing code
> for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
> And importantly, those code paths were largely solved problems.  I.e. I didn't
> want to get into a situation where TDX undid the parallelization of the TDP MMU,
> and then had to add it back after the fact.
>
> I think the same holds true here.  I'm not completely opposed to introducing a
> spinlock, but I want to either have a very high level of confidence that the lock
> won't introduce jitter/delay (I have low confidence on this front, at least in
> the proposed patches), or have super clear line of sight to making the contention
> irrelevant, without having to rip apart the code.
>
> My biggest question at this point is: why is all of this being done on-demand?
> IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K
> pages at the last possible moment".  Neither of those seems ideal.
>
> E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand
> PAMT management seems reasonable.  But for PAMTs that are used to track guest-assigned
> memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd?

This seems fine for 4K page backing. But when TDX VMs have huge page
backing, the vast majority of private memory memory wouldn't need PAMT
allocation for 4K granularity.

IIUC guest_memfd allocation happening at 2M granularity doesn't
necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
support is to be properly utilized for huge page backings, there is a
value in not attaching PAMT allocation with guest_memfd allocation.

> I.e. setup PAMT crud when guest_memfd is populated, not when the memory is mapped
> into the guest.  That way setups that cares about guest boot time can preallocate
> guest_memfd in order to get the PAMT stuff out of the way.
>
> You could do the same thing by prefaulting guest memory, but TDX has limitations
> there, and I see very little value in precisely reclaiming PAMT memory when a
> leaf S-EPT is zapped, i.e. when a page is converted from private=>shared.  As
> above, that's just asking for noisy neighbor issues.
>
> The complaints with static PAMT are that it required burning 0.4% of memory even
> if the host isn't actively running TDX VMs.  Burning 0.4% of the memory assigned
> to a guest, regardless of whether it's map private or shared, seems acceptable,
> and I think would give us a lot more flexibility in avoiding locking issues.
>
> Similarly, we could bind a PAMT to non-leaf S-EPT pages during mmu_topup_memory_caches(),
> i.e. when arch.mmu_external_spt_cache is filled.  Then there would be no need for
> a separate vcpu->arch.pamt_page_cache, and more work would be done outside of
> mmu_lock.  Freeing SPTs would still be done under mmu_lock (I think), but that
> should be a much rarer operation.
>

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by kas@kernel.org 2 months ago

On Mon, Aug 11, 2025 at 07:31:26PM -0700, Vishal Annapurve wrote:
> On Mon, Aug 11, 2025 at 7:02 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> > > On Mon, 2025-08-11 at 07:31 +0100, kas@kernel.org wrote:
> > > > > I don't see any other reason for the global spin lock, Kirill was that
> > > > > it?  Did you consider also adding a lock per 2MB region, like the
> > > > > refcount? Or any other granularity of lock besides global? Not saying
> > > > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > > > above right.
> > > >
> > > > We have discussed this before[1]. Global locking is problematic when you
> > > > actually hit contention. Let's not complicate things until we actually
> > > > see it. I failed to demonstrate contention without huge pages. With huge
> > > > pages it is even more dubious that we ever see it.
> > > >
> > > > [1]
> > > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@intel.com/
> > >
> > > Ah, I see.
> > >
> > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge
> >
> > How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?
> >
> > There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
> > private<=>shared conversions and thus interferes with PAMT allocations for other
> > VMs.
> >
> > > pages) and then shutting them down. I saw 701 contentions on startup, and 53
> > > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> > > But it probably wasn't much of a cacheline bouncing worse case.
> >
> > Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
> > SEAMCALL is easily the long pole in the flow.
> >
> > > And I guess this is on my latest changes not this exact v2, but it shouldn't
> > > have changed.
> > >
> > > But hmm, it seems Dave's objection about maintaining the lock allocations would
> > > apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> > > issue for TDX because they gets rejected if the allocations are not already
> > > there. So complexity of a per-2MB lock should be minimal, at least
> > > incrementally. The difference seems more about memory use vs performance.
> > >
> > > What gives me pause is in the KVM TDX work we have really tried hard to not take
> > > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> > > hard numbers.
> >
> > Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
> > for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
> > as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
> > that will always hold true.
> >
> > > But an enormous amount of work went into lettings KVM faults happen under the
> > > shared lock for normal VMs. So on one hand, yes it's premature optimization.
> > > But on the other hand, it's a maintainability concern about polluting the
> > > existing way things work in KVM with special TDX properties.
> > >
> > > I think we need to at least call out loudly that the decision was to go with the
> > > simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> > > opinion is, but I wouldn't want him to first learn of it when he went digging
> > > and found a buried global spin lock in the fault path.
> >
> > Heh, too late, I saw it when this was first posted.  And to be honest, my initial
> > reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
> > had time to think things through, I'm not _totally_ opposed to having a spinlock
> > in the page fault path, but my overall sentiment remains the same.
> >
> > For mmu_lock and related SPTE operations, I was super adamant about not taking
> > exclusive locks because based on our experience with the TDP MMU, converting flows
> > from exclusive to shared is usually significantly more work than developing code
> > for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
> > And importantly, those code paths were largely solved problems.  I.e. I didn't
> > want to get into a situation where TDX undid the parallelization of the TDP MMU,
> > and then had to add it back after the fact.
> >
> > I think the same holds true here.  I'm not completely opposed to introducing a
> > spinlock, but I want to either have a very high level of confidence that the lock
> > won't introduce jitter/delay (I have low confidence on this front, at least in
> > the proposed patches), or have super clear line of sight to making the contention
> > irrelevant, without having to rip apart the code.
> >
> > My biggest question at this point is: why is all of this being done on-demand?
> > IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K
> > pages at the last possible moment".  Neither of those seems ideal.
> >
> > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand
> > PAMT management seems reasonable.  But for PAMTs that are used to track guest-assigned
> > memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd?
> 
> This seems fine for 4K page backing. But when TDX VMs have huge page
> backing, the vast majority of private memory memory wouldn't need PAMT
> allocation for 4K granularity.
> 
> IIUC guest_memfd allocation happening at 2M granularity doesn't
> necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
> support is to be properly utilized for huge page backings, there is a
> value in not attaching PAMT allocation with guest_memfd allocation.

Right.

It also requires special handling in many places in core-mm. Like, what
happens if THP in guest memfd got split. Who would allocate PAMT for it?
Migration will be more complicated too (when we get there).

-- 
Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 4 weeks ago

On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote:
> > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages,
> > > on-demand
> > > PAMT management seems reasonable.  But for PAMTs that are used to track
> > > guest-assigned
> > > memory, which is the vaaast majority of PAMT memory, why not hook
> > > guest_memfd?
> > 
> > This seems fine for 4K page backing. But when TDX VMs have huge page
> > backing, the vast majority of private memory memory wouldn't need PAMT
> > allocation for 4K granularity.
> > 
> > IIUC guest_memfd allocation happening at 2M granularity doesn't
> > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
> > support is to be properly utilized for huge page backings, there is a
> > value in not attaching PAMT allocation with guest_memfd allocation.
> 
> Right.
> 
> It also requires special handling in many places in core-mm. Like, what
> happens if THP in guest memfd got split. Who would allocate PAMT for it?
> Migration will be more complicated too (when we get there).

I actually went down this path too, but the problem I hit was that TDX module
wants the PAMT page size to match the S-EPT page size. And the S-EPT size will
depend on runtime behavior of the guest. I'm not sure why TDX module requires
this though. Kirill, I'd be curious to understand the constraint more if you
recall.

But in any case, it seems there are multiple reasons.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Kiryl Shutsemau 1 month, 4 weeks ago

On Tue, Aug 12, 2025 at 03:12:52PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote:
> > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages,
> > > > on-demand
> > > > PAMT management seems reasonable.  But for PAMTs that are used to track
> > > > guest-assigned
> > > > memory, which is the vaaast majority of PAMT memory, why not hook
> > > > guest_memfd?
> > > 
> > > This seems fine for 4K page backing. But when TDX VMs have huge page
> > > backing, the vast majority of private memory memory wouldn't need PAMT
> > > allocation for 4K granularity.
> > > 
> > > IIUC guest_memfd allocation happening at 2M granularity doesn't
> > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
> > > support is to be properly utilized for huge page backings, there is a
> > > value in not attaching PAMT allocation with guest_memfd allocation.
> > 
> > Right.
> > 
> > It also requires special handling in many places in core-mm. Like, what
> > happens if THP in guest memfd got split. Who would allocate PAMT for it?
> > Migration will be more complicated too (when we get there).
> 
> I actually went down this path too, but the problem I hit was that TDX module
> wants the PAMT page size to match the S-EPT page size. And the S-EPT size will
> recall.

With DPAMT, when you pass page pair to PAMT.ADD they will be stored in the
PAMT_2M entry. So PAMT_2M entry cannot be used as a leaf entry anymore.

In theory, TDX module could stash them somewhere else, like generic memory
pool to be used for PAMT_4K when needed. But it is significantly different
design to what we have now with different set of problems.

-- 
Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Sean Christopherson 1 month, 4 weeks ago

On Tue, Aug 12, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote:
> > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT
> > > > pages, on-demand PAMT management seems reasonable.  But for PAMTs that
> > > > are used to track guest-assigned memory, which is the vaaast majority
> > > > of PAMT memory, why not hook guest_memfd?
> > > 
> > > This seems fine for 4K page backing. But when TDX VMs have huge page
> > > backing, the vast majority of private memory memory wouldn't need PAMT
> > > allocation for 4K granularity.
> > > 
> > > IIUC guest_memfd allocation happening at 2M granularity doesn't
> > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
> > > support is to be properly utilized for huge page backings, there is a
> > > value in not attaching PAMT allocation with guest_memfd allocation.

I don't disagree, but the host needs to plan for the worst, especially since the
guest can effectively dictate the max page size of S-EPT mappings.  AFAIK, there
are no plans to support memory overcommit for TDX guests, so unless a deployment
wants to roll the dice and hope TDX guests will use hugepages for N% of their
memory, the host will want to reserve 0.4% of guest memory for PAMTs to ensure
it doesn't unintentionally DoS the guest with an OOM condition.

Ditto for any use case that wants to support dirty logging (ugh), because dirty
logging will require demoting all of guest memory to 4KiB mappings.

> > Right.
> > 
> > It also requires special handling in many places in core-mm. Like, what
> > happens if THP in guest memfd got split. Who would allocate PAMT for it?

guest_memfd?  I don't see why core-mm would need to get involved.  And I definitely
don't see how handling page splits in guest_memfd would be more complicated than
handling them in KVM's MMU.

> > Migration will be more complicated too (when we get there).

Which type of migration?  Live migration or page migration?

> I actually went down this path too, but the problem I hit was that TDX module
> wants the PAMT page size to match the S-EPT page size. 

Right, but over-populating the PAMT would just result in "wasted" memory, correct?
I.e. KVM can always provide more PAMT entries than are needed.  Or am I
misunderstanding how dynamic PAMT works?

In other words, IMO, reclaiming PAMT pages on-demand is also a premature optimization
of sorts, as it's not obvious to me that the host would actually be able to take
advantage of the unused memory.

> And the S-EPT size will depend on runtime behavior of the guest. I'm not sure
> why TDX module requires this though. Kirill, I'd be curious to understand the
> constraint more if you recall.
> 
> But in any case, it seems there are multiple reasons.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Kiryl Shutsemau 1 month, 4 weeks ago

On Tue, Aug 12, 2025 at 09:15:16AM -0700, Sean Christopherson wrote:
> On Tue, Aug 12, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote:
> > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT
> > > > > pages, on-demand PAMT management seems reasonable.  But for PAMTs that
> > > > > are used to track guest-assigned memory, which is the vaaast majority
> > > > > of PAMT memory, why not hook guest_memfd?
> > > > 
> > > > This seems fine for 4K page backing. But when TDX VMs have huge page
> > > > backing, the vast majority of private memory memory wouldn't need PAMT
> > > > allocation for 4K granularity.
> > > > 
> > > > IIUC guest_memfd allocation happening at 2M granularity doesn't
> > > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
> > > > support is to be properly utilized for huge page backings, there is a
> > > > value in not attaching PAMT allocation with guest_memfd allocation.
> 
> I don't disagree, but the host needs to plan for the worst, especially since the
> guest can effectively dictate the max page size of S-EPT mappings.  AFAIK, there
> are no plans to support memory overcommit for TDX guests, so unless a deployment
> wants to roll the dice and hope TDX guests will use hugepages for N% of their
> memory, the host will want to reserve 0.4% of guest memory for PAMTs to ensure
> it doesn't unintentionally DoS the guest with an OOM condition.
> 
> Ditto for any use case that wants to support dirty logging (ugh), because dirty
> logging will require demoting all of guest memory to 4KiB mappings.
> 
> > > Right.
> > > 
> > > It also requires special handling in many places in core-mm. Like, what
> > > happens if THP in guest memfd got split. Who would allocate PAMT for it?
> 
> guest_memfd?  I don't see why core-mm would need to get involved.  And I definitely
> don't see how handling page splits in guest_memfd would be more complicated than
> handling them in KVM's MMU.
> 
> > > Migration will be more complicated too (when we get there).
> 
> Which type of migration?  Live migration or page migration?

Page migration.

But I think after some reading, it can be manageable.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Vishal Annapurve 1 month, 4 weeks ago

On Tue, Aug 12, 2025 at 9:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Aug 12, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-08-12 at 09:04 +0100, kas@kernel.org wrote:
> > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT
> > > > > pages, on-demand PAMT management seems reasonable.  But for PAMTs that
> > > > > are used to track guest-assigned memory, which is the vaaast majority
> > > > > of PAMT memory, why not hook guest_memfd?
> > > >
> > > > This seems fine for 4K page backing. But when TDX VMs have huge page
> > > > backing, the vast majority of private memory memory wouldn't need PAMT
> > > > allocation for 4K granularity.
> > > >
> > > > IIUC guest_memfd allocation happening at 2M granularity doesn't
> > > > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
> > > > support is to be properly utilized for huge page backings, there is a
> > > > value in not attaching PAMT allocation with guest_memfd allocation.
>
> I don't disagree, but the host needs to plan for the worst, especially since the
> guest can effectively dictate the max page size of S-EPT mappings.  AFAIK, there
> are no plans to support memory overcommit for TDX guests, so unless a deployment
> wants to roll the dice and hope TDX guests will use hugepages for N% of their
> memory, the host will want to reserve 0.4% of guest memory for PAMTs to ensure
> it doesn't unintentionally DoS the guest with an OOM condition.

Reasonable guest VMs (e.g. Linux) will generally map things mostly at
hugepage granularity, I don't think there is a reason here to be more
conservative and just increase the cost for the well behaved guests.
That being said, The scenario of an unreasonable guest could be
covered in future by modifying how PAMT allocation is
accounted/charged.

Guests are generally free to use the lazy pvalidate/accept features so
the host can't guarantee the needed PAMT memory to be always there
anyways.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 4 weeks ago

On Tue, 2025-08-12 at 09:15 -0700, Sean Christopherson wrote:
> > I actually went down this path too, but the problem I hit was that TDX
> > module wants the PAMT page size to match the S-EPT page size. 
> 
> Right, but over-populating the PAMT would just result in "wasted" memory,
> correct? I.e. KVM can always provide more PAMT entries than are needed.  Or am
> I misunderstanding how dynamic PAMT works?

Demote needs DPAMT pages in order to split the DPAMT. But "needs" is what I was
hoping to understand better.

I do think though, that we should consider premature optimization vs re-
architecting DPAMT only for the sake of a short term KVM design. As in, if fault
path managed DPAMT is better for the whole lazy accept way of things, it
probably makes more sense to just do it upfront with the existing architecture.

BTW, I think I untangled the fault path DPAMT page allocation code in this
series. I basically moved the existing external page cache allocation to
kvm/vmx/tdx.c. So the details of the top up and external page table cache
happens outside of x86 mmu code. The top up structure comes from arch/x86 side
of tdx code, so the cache can just be passed into tdx_pamt_get(). And from the
MMU code's perspective there is just one type "external page tables". It doesn't
know about DPAMT at all.

So if that ends up acceptable, I think the main problem left is just this global
lock. And it seems we have a simple solution for it if needed.

> 
> In other words, IMO, reclaiming PAMT pages on-demand is also a premature
> optimization of sorts, as it's not obvious to me that the host would actually
> be able to take advantage of the unused memory.

I was imagining some guestmemfd callback to setup DPAMT backing for all the
private memory. Just leave it when it's shared for simplicity. Then cleanup
DPAMT when the pages are freed from guestmemfd. The control pages could have
their own path like it does in this series. But it doesn't seem supported.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Vishal Annapurve 1 month, 4 weeks ago

> On Tue, Aug 12, 2025 at 11:39 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-08-12 at 09:15 -0700, Sean Christopherson wrote:
> > > I actually went down this path too, but the problem I hit was that TDX
> > > module wants the PAMT page size to match the S-EPT page size.
> >
> > Right, but over-populating the PAMT would just result in "wasted" memory,
> > correct? I.e. KVM can always provide more PAMT entries than are needed.  Or am
> > I misunderstanding how dynamic PAMT works?
>
> Demote needs DPAMT pages in order to split the DPAMT. But "needs" is what I was
> hoping to understand better.
>
> I do think though, that we should consider premature optimization vs re-
> architecting DPAMT only for the sake of a short term KVM design. As in, if fault
> path managed DPAMT is better for the whole lazy accept way of things, it
> probably makes more sense to just do it upfront with the existing architecture.
>
> BTW, I think I untangled the fault path DPAMT page allocation code in this
> series. I basically moved the existing external page cache allocation to
> kvm/vmx/tdx.c. So the details of the top up and external page table cache
> happens outside of x86 mmu code. The top up structure comes from arch/x86 side
> of tdx code, so the cache can just be passed into tdx_pamt_get(). And from the
> MMU code's perspective there is just one type "external page tables". It doesn't
> know about DPAMT at all.
>
> So if that ends up acceptable, I think the main problem left is just this global
> lock. And it seems we have a simple solution for it if needed.
>
> >
> > In other words, IMO, reclaiming PAMT pages on-demand is also a premature
> > optimization of sorts, as it's not obvious to me that the host would actually
> > be able to take advantage of the unused memory.
>
> I was imagining some guestmemfd callback to setup DPAMT backing for all the
> private memory. Just leave it when it's shared for simplicity. Then cleanup
> DPAMT when the pages are freed from guestmemfd. The control pages could have
> their own path like it does in this series. But it doesn't seem supported.

IMO, tieing lifetime of guest_memfd folios with that of KVM ownership
beyond the memslot lifetime is leaking more state into guest_memfd
than needed. e.g. This will prevent usecases where guest_memfd needs
to be reused while handling reboot of a confidential VM [1].

IMO, if avoidable, its better to not have DPAMT or generally other KVM
arch specific state tracking hooked up to guest memfd folios specially
with hugepage support and whole folio splitting/merging that needs to
be done. If you still need it, guest_memfd should be stateless as much
as possible just like we are pushing for SNP preparation tracking [2]
to happen within KVM SNP and IMO any such tracking should ideally be
cleaned up on memslot unbinding.

[1] https://lore.kernel.org/kvm/CAGtprH9NbCPSwZrQAUzFw=4rZPA60QBM2G8opYo9CZxRiYihzg@mail.gmail.com/
[2] https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 4 weeks ago

On Tue, 2025-08-12 at 15:00 -0700, Vishal Annapurve wrote:
> IMO, tieing lifetime of guest_memfd folios with that of KVM ownership
> beyond the memslot lifetime is leaking more state into guest_memfd
> than needed. e.g. This will prevent usecases where guest_memfd needs
> to be reused while handling reboot of a confidential VM [1].

How does it prevent this? If you really want to re-use guest memory in a fast
way then I think you would want the DPAMT to remain in place actually. It sounds
like an argument to trigger the add/remove from guestmemfd actually.

But I really question with all the work to rebuild S-EPT, and if you propose
DPAMT too, how much work is really gained by not needing to reallocate hugetlbfs
pages. Do you see how it could be surprising? I'm currently assuming there is
some missing context.

> 
> IMO, if avoidable, its better to not have DPAMT or generally other KVM
> arch specific state tracking hooked up to guest memfd folios specially
> with hugepage support and whole folio splitting/merging that needs to
> be done. If you still need it, guest_memfd should be stateless as much
> as possible just like we are pushing for SNP preparation tracking [2]
> to happen within KVM SNP and IMO any such tracking should ideally be
> cleaned up on memslot unbinding.

I'm not sure gmemfd would need to track state. It could be a callback. But it
may be academic anyway. Below...

> 
> [1] https://lore.kernel.org/kvm/CAGtprH9NbCPSwZrQAUzFw=4rZPA60QBM2G8opYo9CZxRiYihzg@mail.gmail.com/
> [2] https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/

Looking into that more, from the code it seems it's not quite so
straightforward. Demote will always require new DPAMT pages to be passed, and
promote will always remove the 4k DPAMT entries and pass them back to the host.
But on early reading, 2MB PAGE.AUG looks like it can handle the DPAMT being
mapped at 4k. So maybe there is some wiggle room there? But before I dig
further, I think I've heard 4 possible arguments for keeping the existing
design:

1. TDX module may require or at least push the caller to have S-EPT match DPAMT
size (confirmation TBD)
2. Mapping DPAMT all at 4k requires extra memory for TD huge pages
3. It *may* slow TD boots because things can't be lazily installed via the fault
path. (testing not done)
4. While the global lock is bad, there is an easy fix for that if it is needed.

It seems Vishal cares a lot about (2). So I'm wondering if we need to keep going
down this path.

In the meantime, I'm going to try to get some better data on the global lock
contention (Sean's question about how much of the memory was actually faulted
in).

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Vishal Annapurve 1 month, 4 weeks ago

On Tue, Aug 12, 2025 at 4:35 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-08-12 at 15:00 -0700, Vishal Annapurve wrote:
> > IMO, tieing lifetime of guest_memfd folios with that of KVM ownership
> > beyond the memslot lifetime is leaking more state into guest_memfd
> > than needed. e.g. This will prevent usecases where guest_memfd needs
> > to be reused while handling reboot of a confidential VM [1].
>
> How does it prevent this? If you really want to re-use guest memory in a fast
> way then I think you would want the DPAMT to remain in place actually. It sounds
> like an argument to trigger the add/remove from guestmemfd actually.

With the reboot usecase, a new VM may start with it's own new HKID so
I don't think we can preserve any state that's specific to the
previous instance. We can only reduce the amount of state to be
maintained in SEPTs/DPAMTs by using hugepages wherever possible.

>
> But I really question with all the work to rebuild S-EPT, and if you propose
> DPAMT too, how much work is really gained by not needing to reallocate hugetlbfs
> pages. Do you see how it could be surprising? I'm currently assuming there is
> some missing context.

I would not limit the reboot usecase to just hugepage backing
scenario. guest_memfd folios (and ideally the guest_memfd files
themselves) simply should be reusable outside the VM lifecycle
irrespective of whether it's used to back CoCo VMs or not.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 month, 4 weeks ago

On Tue, 2025-08-12 at 17:18 -0700, Vishal Annapurve wrote:
> On Tue, Aug 12, 2025 at 4:35 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Tue, 2025-08-12 at 15:00 -0700, Vishal Annapurve wrote:
> > > IMO, tieing lifetime of guest_memfd folios with that of KVM ownership
> > > beyond the memslot lifetime is leaking more state into guest_memfd
> > > than needed. e.g. This will prevent usecases where guest_memfd needs
> > > to be reused while handling reboot of a confidential VM [1].
> > 
> > How does it prevent this? If you really want to re-use guest memory in a fast
> > way then I think you would want the DPAMT to remain in place actually. It sounds
> > like an argument to trigger the add/remove from guestmemfd actually.
> 
> With the reboot usecase, a new VM may start with it's own new HKID so
> I don't think we can preserve any state that's specific to the
> previous instance. We can only reduce the amount of state to be
> maintained in SEPTs/DPAMTs by using hugepages wherever possible.

This is saying that the S-EPT can't be preserved, which doesn't really help me
understand why page allocations are such a big source of the work. I guess you
are saying it's the only thing possible to do?

Hmm, I'm not sure why the S-EPT couldn't be preserved, especially when you allow
for changes to KVM or the TDX module.

But if we are trying to solve the problem of making TD reboot faster, let's
figure out the biggest things that are making it slow first and work on that.
Like it's missing a lot of context on why this turned out to be the right
optimization to do.

Disclaimer: This optimization may be great for other types of VMs and that is
all well and good, but the points are about TDX here and the justification of
the TD reboot optimization is relevant to how we implement DPAMT.

> 
> > 
> > But I really question with all the work to rebuild S-EPT, and if you propose
> > DPAMT too, how much work is really gained by not needing to reallocate hugetlbfs
> > pages. Do you see how it could be surprising? I'm currently assuming there is
> > some missing context.
> 
> I would not limit the reboot usecase to just hugepage backing
> scenario. guest_memfd folios (and ideally the guest_memfd files
> themselves) simply should be reusable outside the VM lifecycle
> irrespective of whether it's used to back CoCo VMs or not.

Still surprised that host page allocations turned out to be the biggest thing
sticking out.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 3 months, 2 weeks ago

On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> This patchset enables Dynamic PAMT in TDX. Please review.
> 
> Previously, we thought it can get upstreamed after huge page support, but
> huge pages require support on guestmemfd side which might take time to hit
> upstream. Dynamic PAMT doesn't have dependencies.

Did you run this through the latest TDX selftests? Specifically Reinette's WIP
MMU stress test would be real good to test on this.

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by kirill.shutemov@linux.intel.com 3 months, 2 weeks ago

On Wed, Jun 25, 2025 at 10:49:16PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> > This patchset enables Dynamic PAMT in TDX. Please review.
> > 
> > Previously, we thought it can get upstreamed after huge page support, but
> > huge pages require support on guestmemfd side which might take time to hit
> > upstream. Dynamic PAMT doesn't have dependencies.
> 
> Did you run this through the latest TDX selftests? Specifically Reinette's WIP
> MMU stress test would be real good to test on this. 

I didn't. Will do.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

Posted by Kirill A. Shutemov 3 months, 2 weeks ago

On Mon, Jun 09, 2025 at 10:13:28PM +0300, Kirill A. Shutemov wrote:
> This patchset enables Dynamic PAMT in TDX. Please review.

Gentle ping?

The patchset is crucial to get TDX adopted. The upfront memory cost of
enabling TDX for the machine is too high, especially if you don't know in
advance if the machine will run TDX guests.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov