[v3] TDX: Enable Dynamic PAMT

[PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Rick Edgecombe 1 week, 6 days ago

Hi,

This is 3rd revision of Dynamic PAMT, which is a new feature that reduces 
memory use of TDX.

On v2 (as well as in PUCK) there was some discussion of the 
refcount/locking design tradeoffs for Dynamic PAMT. In v2, I’ve basically 
gone through and tried to make the details around this more reviewable. 
The basic solution is the same as v2, with the changes more about moving 
code around or splitting implementations/optimizations. I’m hoping with 
this v3 we can close on whether that approach is good enough or not.

I think the patch quality is in ok shape, but still need some review. 
Maintainers, please feel free to let us go through this v3 for lower level 
code issues, but I would appreciate engagement on the overall design.

Another open still is performance testing, besides the bit about excluding 
contention of the global lock.

Lastly, Yan raised some last minute doubts internally about TDX module 
locking contention. I’m not sure there is a problem, but we can come to an 
agreement as part of the review.

PAMT Background
===============
The TDX module needs to keep data about each physical page it uses. It 
requires the kernel to give it memory to use for this purpose, called 
PAMT. Internally it wants space for metadata for each page *and* each page 
size. That is, if a page is mapped at 2MB in a TD, it doesn’t spread this 
tracking across the allocations it uses for 4KB page size usage of the 
same physical memory. It is designed to use a separate allocation for 
this.

So each memory region that the TDX module could use (aka TDMRs) has three 
of these PAMT allocations. They are all allocated during the global TDX 
initialization, regardless of if the memory is actually getting used for a 
TD. It uses up approximately 0.4% of system memory.

Dynamic PAMT (DPAMT)
====================
Fortunately, only using physical memory for areas of an address space that 
are actually in use is a familiar problem in system engineering, with a 
well trodden solution: page tables. It would be great if TDX could do 
something like that for PAMT. This is basically the idea for Dynamic PAMT.

However, there are some design aspects that could be surprising for anyone 
expecting “PAMT, but page tables”. The following describes these 
differences.

DPAMT Levels
------------
Dynamic PAMT focuses on the page size level that has the biggest PAMT 
allocation - 4KB page size. Since the 2MB and 1GB levels are smaller 
allocations, they are left as the fixed arrays, as in normal PAMT. But the 
4KB page sizes are actually not fully dynamic either, the TDX module still 
requires a physically contiguous space for tracking each 4KB page in a 
TDMR. But this space shrinks significantly, to currently 1 bit per page.

Page Sizes
----------
Like normal PAMT, Dynamic PAMT wants to provide a way for the TDX module 
to have separate PAMT tracking for different page sizes. But unlike normal 
PAMT, it does not seamlessly switch between the 2MB and 4KB page sizes 
without VMM action. It wants the PAMT mapped at the same level that the 
underlying TDX memory is using. In practice this means the VMM needs to 
update the PAMT depending on whether secure EPT pages are to be mapped at 
2MB or 4KB.

While demote/promote have internal handling for these updates (and support 
passing or returning the PAMT pages involved), PAGE.ADD/AUG don’t. Instead 
two new SEAMCALLs are provided for the VMM to configure the PAMT to the 
intended page size (i.e. 4KB if the page will be mapped at 4KB): 
TDH.PHYMEM.PAMT.ADD/REMOVE.

While some operations on TD memory can internally handle the PAMT 
adjustments, the opposite is not true. That is changes in PAMT don’t try 
to automatically change private S-EPT page sizes. Instead an attempt to 
remove 4KB size PAMT pages while will fail if any of the covering range are
in use.

Concurrency
-----------
For every 2MB physical range there could be many 4KB pages used by TDX 
(obviously). But each of those only needs one set of PAMT pages added. So 
on the first use of the 2MB region the DPAMT needs to be installed, and 
after none of the pages in that range are in use, it needs to be freed.

The TDX module actually does track how many pages are using each 2MB range 
and gives hints for refcount of zero. But it is incremented when the 2MB 
region use actually starts. Like:
1.  TDH.PHYMEM.PAMT.ADD (backing for
                         2MB range X,
                         for page Y)
2.  TDH.MEM.PAGE.AUG (page Y)        <- Increments refcount for X.
3.  TDH.MEM.PAGE.REMOVE (page Y)     <- Decrements refcount for X,
                                        gives hint in return value.
4.  TDH.PHYMEM.PAMT.REMOVE (range X) <- Remove, check that each 4KB
                                        page in X is free.

The internal refcount is tricky to use because of the window of time 
between TDH.PHYMEM.PAMT.ADD and TDH.MEM.PAGE.AUG. The PAMT.ADD adds the 
backing, but doesn’t tell the TDX module a VMM intends to map it. Consider 
the range X that includes page Y and Z, for an implementation that tries 
to use these hints:

CPU 0                              CPU 1
TDH.PHYMEM.PAMT.ADD X (returns
                       already
                       mapped)
                                   TDH.MEM.PAGE.REMOVE (Y) (returns
                                                            refcount 0
                                                            hint)
                                   TDH.PHYMEM.PAMT.REMOVE (X)
TDH.MEM.PAGE.AUG Z (fail)


So the TDX module’s DPAMT refcounts don't track what the VMM intends to do 
with the page, only what it has already done. This leaves a window that
needs to be covered.

TDX module locking
------------------
Inside the TDX module there are various locks. The TDX module does not 
wait when it encounters contention, instead it returns a BUSY error code. 
This leaves the VMM with an option to either loop around the SEAMCALL, or 
return an error to the calling code. In some cases in Linux there is not 
an option to return an error from the operation making the SEAMCALL. To 
avoid potentially an indeterminable amount of retries, it opts to kick all 
the threads associated with the TD out of the TDX module and retry. This 
retry operation is fairly easy to do from with KVM.

Since PAMT is a global resource, this means that this lock contention 
could be from any TD. For normal PAMT, the exclusive locks locks are only 
taken at the 4KB page size granularity. In practice, this means any page 
that is not shared  between TDs won’t have to worry about contention. 
However, for DPAMT this changes. The TDH.PHYMEM.PAMT.ADD/REMOVE calls take 
a PAMT lock at 2MB granularity. If two calls try to operate on the same 
2MB region at the same time, one will get the BUSY error code.


Linux Challenges
================
So for dynamic PAMT, Linux needs to:
 1. Allocate a different fixed sized allocation for each TDMR, for the 4KB
    page size (the 1 bit per page bitmap instead of the normal larger
    allocation)
 2. Allocate DPAMT for control structures.
 3. Allocate DPAMT 4KB page backings for all TD private memory (which is
    currently only 4KB) and S-EPT page tables.

1 is easy, just query the new size when DPAMT is in use and use that 
instead of the regular 4KB PAMT size. The bitmap allocation is even passed 
in the same field to the TDX module. If you takeaway the TDX docs naming 
around the bitmap, it’s like a buffer that changes size.

For 2 and 3, there is a lot to consider. For 2, it is relatively easy as 
long as we want to install PAMT on demand since these pages come straight 
from the page allocator and not guestmemfd.

For 3, upstream we currently only have 4KB pages, which means we could 
ignore a lot of the specifics about matching page sizes - there is only 
one. However, TDX huge pages is also in progress. So we should avoid a 
design that would need to be redone immediately.

Installing 4KB DPAMT backings
-----------------------------
Some TDX pages are used for TD control structures. These need to have new 
code to install 4KB DPAMT backings. But the main problem is how do this 
for private memory.

Linux needs to add the DPAMT backing private memory before the page gets 
mapped in the TD. Doing so inside the KVM MMU call paths adds 
complications around the BUSY error code, as described above. It would be 
tempting to install DPAMT pages from a guestmemfd callback. This could 
happen outside the KVM MMU locks.

But there are three complications with this. One is that in case of 2MB 
pages, the guest can control the page size. This means, even if 4KB page 
sizes are installed automatically, KVM would have to handle edge cases of 
PAMT adjustments at runtime anyway. For example memslot deletion and 
re-adding would trigger a zap of huge pages that are later remapped at 
4KB. This could *maybe* be worked around by some variant of this technique 
[0].

Another wrinkle is that Vishal@google has expressed a strong interest in 
saving PAMT memory at runtime in the case of 2MB TD private memory. He 
wants to support a use case where most TD memory is mapped at 2MB, so he 
wants to avoid the overhead of a worse case allocation that assumes all 
memory will be mapped at 4KB.

Finally, pre-installing DPAMT pages before the fault doesn’t help with 
mapping DPAMT pages for the external (S-EPT) page tables that are 
allocated for the fault. So some fault time logic is needed. We could 
pre-install DPAMT backing for the external page table cache, which would 
happen outside of the MMU lock. This would free us from having to update 
DPAMT inside MMU lock. But it would not free KVM from having to do 
anything around DPAMT during a fault.

Three non-show stopping issues tilts things towards using a fault time 
DPAMT installation approach for private memory.

Avoiding duplicate attempts to add/remove DPAMT 
-----------------------------------------------
As covered above, there isn’t a refcount in the TDX module that we can 
use. Using the hints returned by TDH.MEM.PAGE.AUG/REMOVE was tried in v1, 
and Kirill was unable to both overcome the races and make nice with new 
failure scenarios around DPAMT locks. So in v2 refcounts were allocated on 
the kernel side for each 2MB range covered by a TDMR (a range the TDX 
module might use). This adds some small memory overhead of 0.0002%, which 
is small compared to the 0.4% to 0.004% savings of Dynamic PAMT. The major 
downside is code complexity. These allocations are still large and involve 
managing vmalloc space. The v2 solution reserves a vmalloc space to cover 
the entire physical address space, and only maps pages for any ranges that 
are covered by a TDMR.

Avoiding contention
-------------------
As covered above, Linux needs to deal with getting BUSY error codes here. 
The allocation/mapping paths for all these pages can already handle 
failure, but the trickier case is the removal paths. As previously sparred 
with during the base series, TDX module expects to be able to fail these 
calls, but the kernel does not. Further, the refcounts can not act as a 
race free lock on their own. So some synchronization is needed before 
actually removing the DPAMT backings.

V2 of the series includes a global lock to be used around actual 
installation/removal of the DPAMT backing, combined with opportunistic 
checking outside the lock to avoid taking it most of the time. In testing, 
booting 10 16GB TDs, the lock only hit contention 1136 times, with 4ms 
waiting. This is very small for an operation that took 60s of wall time. 
So despite being an (ugly) global lock, the actual impact was small. It 
will probably further be reduced in the case of huge pages, where most of 
the time 4KB DPAMT installation will not be necessary.

Updates in v3
=============
Besides incorporating the feedback and general cleanups, the major design 
change was around how DPAMT backing pages are allocated in the fault path.

Allocating DPAMT pages in v2
----------------------------
In v2, there was a specific page cache added in the generic x86 KVM MMU 
for DPAMT pages. This was needed because much of the fault happens inside 
a spinlock. By the time the fault handler knows whether it needs to 
install DPAMT backing, it can no longer allocate pages. This is a common 
pattern in the KVM MMU, and so pages are pre-allocated before taking the 
MMU spinlock. This way they can be used later if needed.

But KVM’s page cache infrastructure is not easy to pass into the arch/86 
code and inside the global spin lock where the pages would be consumed. So 
instead it passed a pointer to a function inside KVM that it can call to 
allocate pages from the KVM page cache component. Since the control 
structures need DPAMT backing installed outside of the fault, the arch/x86 
code also had logic to allocate pages directly from the page allocator. 
Further, there were various resulting intermediate lists that had to be 
marshaled through the DPMAT allocation paths.

This was all a bit complicated to me.

Updates in v3
-------------
V3 redid the areas described above to try to simplify it.

The KVM MMU already has knowledge that TDX needs special memory allocated 
for S-EPT. From the fault handlers perspective, this could be seen as just 
more memory of the same type. So v3 just turns the external page table 
allocation to an x86 op, and provides another op to allocate from it. This 
is done as refactoring. Then when dynamic PAMT is added the extra pages 
can just be added from within the x86 op in TDX code.

For removing the function pointer callback scheme, the external page 
tables are switched to a dirt simple linked list based page cache. This is 
somewhat reinventing the wheel, but KVM’s kvm_mmu_memory_cache operations 
are not easy to expose to the core kernel, and also TDX doesn’t need much 
of the fanciness of initial values and things like that. Building it out 
of the kernels linked list is enough code reuse.

Today the TDX module needs 2 pages for 2MB region of 4KB size dynamic PAMT 
backing. It would be tempting to just pass two pages in, but the TDX 
module exposes the number of Dynamic PAMT pages it needs as a metadata 
value. So the size is technically variable. To handle this, the design 
just passes in the simple TDX page cache list in the calls that might need 
to allocate dynamic PAMT.

Considerations for v4
=====================
This solution seems workable. It isolated Dynamic PAMT to TDX code, and 
doesn’t introduce any extra constraints to generic x86 KVM code. But the 
refcounts and global lock in arch/x86 side of TDX are still ugly.

There has been some internal discussion about pursuing various schemes to 
avoid this. But before a potential redesign, I wanted to share the current 
version. Both to get feedback on the updates, and so we can consider how 
“good enough” the current design is.

Testing and branch
==================
Testing is a bit light currently. Just TDX selftests, simple TDX Linux
guest boot. The branch is here: 
https://github.com/intel/tdx/commits/dpamt_v3/

Based on kvm_x86/next (603c090664d3)

[0] https://lore.kernel.org/kvm/20250807094423.4644-1-yan.y.zhao@intel.com/

Kirill A. Shutemov (13):
  x86/tdx: Move all TDX error defines into <asm/shared/tdx_errno.h>
  x86/tdx: Add helpers to check return status codes
  x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
  x86/virt/tdx: Allocate reference counters for PAMT memory
  x86/virt/tdx: Improve PAMT refcounters allocation for sparse memory
  x86/virt/tdx: Add tdx_alloc/free_page() helpers
  x86/virt/tdx: Optimize tdx_alloc/free_page() helpers
  KVM: TDX: Allocate PAMT memory for TD control structures
  KVM: TDX: Allocate PAMT memory for vCPU control structures
  KVM: TDX: Handle PAMT allocation in fault path
  KVM: TDX: Reclaim PAMT memory
  x86/virt/tdx: Enable Dynamic PAMT
  Documentation/x86: Add documentation for TDX's Dynamic PAMT

Rick Edgecombe (3):
  x86/virt/tdx: Simplify tdmr_get_pamt_sz()
  KVM: TDX: Add x86 ops for external spt cache
  x86/virt/tdx: Add helpers to allow for pre-allocating pages

 Documentation/arch/x86/tdx.rst              |  21 +
 arch/x86/coco/tdx/tdx.c                     |   6 +-
 arch/x86/include/asm/kvm-x86-ops.h          |   2 +
 arch/x86/include/asm/kvm_host.h             |  11 +-
 arch/x86/include/asm/shared/tdx.h           |   1 +
 arch/x86/include/asm/shared/tdx_errno.h     | 109 +++++
 arch/x86/include/asm/tdx.h                  |  76 ++-
 arch/x86/include/asm/tdx_global_metadata.h  |   1 +
 arch/x86/kvm/mmu/mmu.c                      |   4 +-
 arch/x86/kvm/mmu/mmu_internal.h             |   2 +-
 arch/x86/kvm/vmx/tdx.c                      | 157 ++++--
 arch/x86/kvm/vmx/tdx.h                      |   3 +-
 arch/x86/kvm/vmx/tdx_errno.h                |  40 --
 arch/x86/virt/vmx/tdx/tdx.c                 | 505 +++++++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h                 |   5 +-
 arch/x86/virt/vmx/tdx/tdx_global_metadata.c |   3 +
 16 files changed, 766 insertions(+), 180 deletions(-)
 create mode 100644 arch/x86/include/asm/shared/tdx_errno.h
 delete mode 100644 arch/x86/kvm/vmx/tdx_errno.h

-- 
2.51.0

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Yan Zhao 6 days, 3 hours ago

On Thu, Sep 18, 2025 at 04:22:08PM -0700, Rick Edgecombe wrote:
> Hi,
> 
> This is 3rd revision of Dynamic PAMT, which is a new feature that reduces 
> memory use of TDX.
> 
> On v2 (as well as in PUCK) there was some discussion of the 
> refcount/locking design tradeoffs for Dynamic PAMT. In v2, I’ve basically 
> gone through and tried to make the details around this more reviewable. 
> The basic solution is the same as v2, with the changes more about moving 
> code around or splitting implementations/optimizations. I’m hoping with 
> this v3 we can close on whether that approach is good enough or not.
> 
> I think the patch quality is in ok shape, but still need some review. 
> Maintainers, please feel free to let us go through this v3 for lower level 
> code issues, but I would appreciate engagement on the overall design.
> 
> Another open still is performance testing, besides the bit about excluding 
> contention of the global lock.
> 
> Lastly, Yan raised some last minute doubts internally about TDX module 
> locking contention. I’m not sure there is a problem, but we can come to an 
> agreement as part of the review.
Yes, I found a contention issue that prevents us from dropping the global lock.
I've also written a sample test that demonstrates this contention.

    CPU 0                                     CPU 1
(a) TDH.PHYMEM.PAMT.ADD(A1, B1, xx1)      (b) TDH.PHYMEM.PAMT.ADD(B2, xx2, xx3)

A1, B2 are not from the same 2MB physical range,
B1, B2 are from the same 2MB physical range.
Physical addresses of xx1, xx2, xx3 are irrelevant.

(a) adds PAMT pages B1, xx1 for A1's 2MB physical range.
(b) adds PAMT pages xx2, xx3 for B2's 2MB physical range.

This could happen when host wants to AUG a 4KB page A1. So, it allocates 4KB
pages B1, xx1, and invokes (a).

Then host wants to AUG a 4KB page B2. Since there're no PAMT pages for B2's 2MB
physical range yet, host invokes (b).

(a) needs to hold shared lock of B1's 2MB PAMT entry.
(b) needs to hold exclusive lock of B2's 2MB PAMT entry.

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Dave Hansen 5 days, 16 hours ago

On 9/25/25 19:28, Yan Zhao wrote:
>> Lastly, Yan raised some last minute doubts internally about TDX
>> module locking contention. I’m not sure there is a problem, but we
>> can come to an agreement as part of the review.
> Yes, I found a contention issue that prevents us from dropping the
> global lock. I've also written a sample test that demonstrates this
> contention.
But what is the end result when this contention happens? Does everyone
livelock?

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 5 days, 14 hours ago

On Fri, 2025-09-26 at 07:09 -0700, Dave Hansen wrote:
> On 9/25/25 19:28, Yan Zhao wrote:
> > > Lastly, Yan raised some last minute doubts internally about TDX
> > > module locking contention. I’m not sure there is a problem, but
> > > we
> > > can come to an agreement as part of the review.
> > Yes, I found a contention issue that prevents us from dropping the
> > global lock. I've also written a sample test that demonstrates this
> > contention.
> But what is the end result when this contention happens? Does
> everyone livelock?

You get a TDX_OPERAND_BUSY error code returned. Inside the TDX module
each lock is a try lock. The TDX module tries to take a sequence of
locks and if it meets any contention it will release them all and
return the TDX_OPERAND_BUSY. Some paths in KVM cannot handle failure
and so don't have a way to handle the error.

So another option to handling this is just to retry until you succeed.
Then you have a very strange spinlock with a heavyweight inner loop.
But since each time the locks are released, some astronomical bad
timing might prevent forward progress. On the KVM side we have avoided
looping. Although, I think we have not exhausted this path.

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Dave Hansen 5 days, 14 hours ago

On 9/26/25 09:02, Edgecombe, Rick P wrote:
> On Fri, 2025-09-26 at 07:09 -0700, Dave Hansen wrote:
>> On 9/25/25 19:28, Yan Zhao wrote:
>>>> Lastly, Yan raised some last minute doubts internally about TDX
>>>> module locking contention. I’m not sure there is a problem, but
>>>> we
>>>> can come to an agreement as part of the review.
>>> Yes, I found a contention issue that prevents us from dropping the
>>> global lock. I've also written a sample test that demonstrates this
>>> contention.
>> But what is the end result when this contention happens? Does
>> everyone livelock?
> 
> You get a TDX_OPERAND_BUSY error code returned. Inside the TDX module
> each lock is a try lock. The TDX module tries to take a sequence of
> locks and if it meets any contention it will release them all and
> return the TDX_OPERAND_BUSY. Some paths in KVM cannot handle failure
> and so don't have a way to handle the error.
> 
> So another option to handling this is just to retry until you succeed.
> Then you have a very strange spinlock with a heavyweight inner loop.
> But since each time the locks are released, some astronomical bad
> timing might prevent forward progress. On the KVM side we have avoided
> looping. Although, I think we have not exhausted this path.

If it can't return failure then the _only_ other option is to spin. Right?

I understand the reluctance to have such a nasty spin loop. But other
than reworking the KVM code to do the retries at a higher level, is
there another option?

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 5 days, 11 hours ago

On Fri, 2025-09-26 at 09:11 -0700, Dave Hansen wrote:
> If it can't return failure then the _only_ other option is to spin.
> Right?

Yea, but you could spin around the SEAMCALL or you could spin on
duplicate locks on the kernel side before making the SEAMCALL. Or put
more generally, you could prevent contention before you make the
SEACMALL. KVM does this also by kicking vCPUs out of the TDX module via
IPI in other cases.

> 
> I understand the reluctance to have such a nasty spin loop. But other
> than reworking the KVM code to do the retries at a higher level,

Re-working KVM code would be tough, although teaching KVM to fail zap
calls has come up before for TDX/gmem interactions. It was looked at
and decided to be too complex. Now I guess the benefit side of the
equation changes a little bit, but doing it only for TDX might still be
a bridge to far.

Unless anyone is holding onto another usage that might want this?

>  is there another option?

I don't see why we can't just duplicate the locking in a more matching
way on the kernel side. Before the plan to someday drop the global lock
if needed, was to switch to 2MB granular locks to match the TDX
module's exclusive lock internal behavior.

What Yan is basically pointing out is that there are shared locks that
are also taken on different ranges that could possibly contend with the
exclusive one that we are duplicating on the kernel side.

So the problem is not fundamental to the approach I think. We just took
a shortcut by ignoring the shared locks. For line-of-sight to a path to
remove the global lock someday, I think we could make the 2MB granular
locks be reader/writer to match the TDX module. Then around the
SEAMCALLs that take these locks, we could take them on the kernel side
in the right order for whichever SEAMCALL we are making.

And that would only be the plan if we wanted to improve scalability
someday without changing the TDX module.

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Yan Zhao 4 days, 4 hours ago

On Sat, Sep 27, 2025 at 03:00:31AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-09-26 at 09:11 -0700, Dave Hansen wrote:
> > If it can't return failure then the _only_ other option is to spin.
> > Right?
> 
> Yea, but you could spin around the SEAMCALL or you could spin on
> duplicate locks on the kernel side before making the SEAMCALL. Or put
> more generally, you could prevent contention before you make the
> SEACMALL. KVM does this also by kicking vCPUs out of the TDX module via
> IPI in other cases.
> 
> > 
> > I understand the reluctance to have such a nasty spin loop. But other
> > than reworking the KVM code to do the retries at a higher level,
> 
> Re-working KVM code would be tough, although teaching KVM to fail zap
> calls has come up before for TDX/gmem interactions. It was looked at
> and decided to be too complex. Now I guess the benefit side of the
> equation changes a little bit, but doing it only for TDX might still be
> a bridge to far.
> 
> Unless anyone is holding onto another usage that might want this?
> 
> >  is there another option?
> 
> I don't see why we can't just duplicate the locking in a more matching
> way on the kernel side. Before the plan to someday drop the global lock
> if needed, was to switch to 2MB granular locks to match the TDX
> module's exclusive lock internal behavior.
> 
> What Yan is basically pointing out is that there are shared locks that
> are also taken on different ranges that could possibly contend with the
> exclusive one that we are duplicating on the kernel side.
> 
> So the problem is not fundamental to the approach I think. We just took
> a shortcut by ignoring the shared locks. For line-of-sight to a path to
> remove the global lock someday, I think we could make the 2MB granular
> locks be reader/writer to match the TDX module. Then around the
> SEAMCALLs that take these locks, we could take them on the kernel side
> in the right order for whichever SEAMCALL we are making.
Not sure if that would work.

In the following scenario, where
(a) adds PAMT pages B1, xx1 for A1's 2MB physical range.
(b) adds PAMT pages A2, xx2 for B2's 2MB physical range.

A1, B2 are not from the same 2MB physical range,
A1, A2 are from the same 2MB physical range.
B1, B2 are from the same 2MB physical range.
Physical addresses of xx1, xx2 are irrelevant.


    CPU 0                                     CPU 1
    ---------------------------------         -----------------------------
    write_lock(&rwlock-of-range-A1);          write_lock(&rwlock-of-range-B2);
    read_lock(&rwlock-of-range-B1);           read_lock(&rwlock-of-range-A2);
    ...                                       ...
(a) TDH.PHYMEM.PAMT.ADD(A1, B1, xx1)      (b) TDH.PHYMEM.PAMT.ADD(B2, A2, xx2)
    ...                                       ...
    read_unlock(&rwlock-of-range-B1);         read_unlock(&rwlock-of-range-A2);
    write_unlock(&rwlock-of-range-A1);        write_unlock(&rwlock-of-range-B2);


To match the reader/writer locks in the TDX module, it looks like we may
encounter an AB-BA lock issue.

Do you have any suggestions for a better approach?

e.g., could the PAMT pages be allocated from a dedicated pool that ensures they
reside in different 2MB ranges from guest private pages and TD control pages?

> And that would only be the plan if we wanted to improve scalability
> someday without changing the TDX module.

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Kiryl Shutsemau 2 days, 19 hours ago

On Sun, Sep 28, 2025 at 09:34:14AM +0800, Yan Zhao wrote:
> On Sat, Sep 27, 2025 at 03:00:31AM +0800, Edgecombe, Rick P wrote:
> > On Fri, 2025-09-26 at 09:11 -0700, Dave Hansen wrote:
> > > If it can't return failure then the _only_ other option is to spin.
> > > Right?
> > 
> > Yea, but you could spin around the SEAMCALL or you could spin on
> > duplicate locks on the kernel side before making the SEAMCALL. Or put
> > more generally, you could prevent contention before you make the
> > SEACMALL. KVM does this also by kicking vCPUs out of the TDX module via
> > IPI in other cases.
> > 
> > > 
> > > I understand the reluctance to have such a nasty spin loop. But other
> > > than reworking the KVM code to do the retries at a higher level,
> > 
> > Re-working KVM code would be tough, although teaching KVM to fail zap
> > calls has come up before for TDX/gmem interactions. It was looked at
> > and decided to be too complex. Now I guess the benefit side of the
> > equation changes a little bit, but doing it only for TDX might still be
> > a bridge to far.
> > 
> > Unless anyone is holding onto another usage that might want this?
> > 
> > >  is there another option?
> > 
> > I don't see why we can't just duplicate the locking in a more matching
> > way on the kernel side. Before the plan to someday drop the global lock
> > if needed, was to switch to 2MB granular locks to match the TDX
> > module's exclusive lock internal behavior.
> > 
> > What Yan is basically pointing out is that there are shared locks that
> > are also taken on different ranges that could possibly contend with the
> > exclusive one that we are duplicating on the kernel side.
> > 
> > So the problem is not fundamental to the approach I think. We just took
> > a shortcut by ignoring the shared locks. For line-of-sight to a path to
> > remove the global lock someday, I think we could make the 2MB granular
> > locks be reader/writer to match the TDX module. Then around the
> > SEAMCALLs that take these locks, we could take them on the kernel side
> > in the right order for whichever SEAMCALL we are making.
> Not sure if that would work.
> 
> In the following scenario, where
> (a) adds PAMT pages B1, xx1 for A1's 2MB physical range.
> (b) adds PAMT pages A2, xx2 for B2's 2MB physical range.
> 
> A1, B2 are not from the same 2MB physical range,
> A1, A2 are from the same 2MB physical range.
> B1, B2 are from the same 2MB physical range.
> Physical addresses of xx1, xx2 are irrelevant.
> 
> 
>     CPU 0                                     CPU 1
>     ---------------------------------         -----------------------------
>     write_lock(&rwlock-of-range-A1);          write_lock(&rwlock-of-range-B2);
>     read_lock(&rwlock-of-range-B1);           read_lock(&rwlock-of-range-A2);
>     ...                                       ...
> (a) TDH.PHYMEM.PAMT.ADD(A1, B1, xx1)      (b) TDH.PHYMEM.PAMT.ADD(B2, A2, xx2)
>     ...                                       ...
>     read_unlock(&rwlock-of-range-B1);         read_unlock(&rwlock-of-range-A2);
>     write_unlock(&rwlock-of-range-A1);        write_unlock(&rwlock-of-range-B2);
> 
> 
> To match the reader/writer locks in the TDX module, it looks like we may
> encounter an AB-BA lock issue.
> 
> Do you have any suggestions for a better approach?
> 
> e.g., could the PAMT pages be allocated from a dedicated pool that ensures they
> reside in different 2MB ranges from guest private pages and TD control pages?

It can work: allocate 2M a time for PAMT and piecemeal it to TDX module
as needed. But it means if 2M allocation is failed, TDX is not functional.

Maybe just use a dedicated kmem_cache for PAMT allocations. Although, I
am not sure if there's a way to specify to kmem_cache what pages to ask
from page allocator.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Dave Hansen 2 days, 13 hours ago

On 9/29/25 04:17, Kiryl Shutsemau wrote:
>> Do you have any suggestions for a better approach?
>>
>> e.g., could the PAMT pages be allocated from a dedicated pool that ensures they
>> reside in different 2MB ranges from guest private pages and TD control pages?
> It can work: allocate 2M a time for PAMT and piecemeal it to TDX module
> as needed. But it means if 2M allocation is failed, TDX is not functional.
> 
> Maybe just use a dedicated kmem_cache for PAMT allocations. Although, I
> am not sure if there's a way to specify to kmem_cache what pages to ask
> from page allocator.

That seems a bit obtuse rather than just respecting normal lock ordering
rules. No?

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 2 days, 13 hours ago

On Mon, 2025-09-29 at 09:22 -0700, Dave Hansen wrote:
> On 9/29/25 04:17, Kiryl Shutsemau wrote:
> > > Do you have any suggestions for a better approach?
> > > 
> > > e.g., could the PAMT pages be allocated from a dedicated pool that ensures they
> > > reside in different 2MB ranges from guest private pages and TD control pages?

I guess the TDX module can get away with doing them in whatever random order
because they are try-locks. Perhaps we could take them in PFN order? If need be
the TDX module could do the same.

> > It can work: allocate 2M a time for PAMT and piecemeal it to TDX module
> > as needed. But it means if 2M allocation is failed, TDX is not functional.
> > 
> > Maybe just use a dedicated kmem_cache for PAMT allocations. Although, I
> > am not sure if there's a way to specify to kmem_cache what pages to ask
> > from page allocator.
> 
> That seems a bit obtuse rather than just respecting normal lock ordering
> rules. No?

It might serve a purpose of proving that scalability is possible. I was thinking
if we had line of sight to improving it we could go with the "simple" solution
and wait until there is a problem. Is it reasonable?

But locking issues and state duplication between the TDX module and kernel are a
recurring pattern. The similar KVM MMU issue was probably the most contentious
design issue we had in the TDX base enabling together with the TD configuration
API. To me DPAMT is just more proof that this is not a rare conflict, but a
fundamental problem. If we do need to find a way to improve the DPAMT
scalability, I want to say we should actually think about the generic case and
try to solve that. But that needs much more thought...

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 1 day, 11 hours ago

On Mon, 2025-09-29 at 09:58 -0700, Rick Edgecombe wrote:
> It might serve a purpose of proving that scalability is possible. I was thinking
> if we had line of sight to improving it we could go with the "simple" solution
> and wait until there is a problem. Is it reasonable?

We discussed this offline and the conclusion was to just keep a global
reader/writer lock as the backpocket solution. So DPAMT seamcalls take the
global reader, and if they get a BUSY code, they take the global lock as write
and retry. Similar to what KVM does on a per-VM scope, but globally and reduced
by the refcount.

But until there is a measured issue, just keep the simpler global exclusive lock
with the refcount to start. So unless anyone new chimes in, we can call this
closed.

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Dave Hansen 5 days, 11 hours ago

On 9/26/25 12:00, Edgecombe, Rick P wrote:
> So the problem is not fundamental to the approach I think. We just took
> a shortcut by ignoring the shared locks. For line-of-sight to a path to
> remove the global lock someday, I think we could make the 2MB granular
> locks be reader/writer to match the TDX module. Then around the
> SEAMCALLs that take these locks, we could take them on the kernel side
> in the right order for whichever SEAMCALL we are making.

Yeah, but what if the TDX module changes its locking scheme in 5 years
or 10? What happens to 6.17.9999-stable?

Re: [PATCH v3 00/16] TDX: Enable Dynamic PAMT

Posted by Edgecombe, Rick P 5 days, 10 hours ago

On Fri, 2025-09-26 at 12:03 -0700, Dave Hansen wrote:
> 
> Yeah, but what if the TDX module changes its locking scheme in 5
> years
> or 10? What happens to 6.17.9999-stable?

Yea, we expected we were turning the locking behavior into ABI even in
the existing S-EPT management code. To change it would require a host
opt-in.

We originally assured correctness via TDX module source code
inspection, but recently the TDX module folks pointed out that the TDX
module ABI docs actually list information about what locks are taken
for each SEAMCALL. Or in it's terms, which resources are held in which
context. It is in the tables labeled "Concurrency Restrictions". Lock
ordering (like we would need to do the above) is not listed though.