From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
init_pamt_metadata() allocates PAMT refcounts for all physical memory up
to max_pfn. It might be suboptimal if the physical memory layout is
discontinuous and has large holes.
The refcount allocation vmalloc allocation. This is necessary to support a
large allocation size. The virtually contiguous property also makes it
easy to find a specific 2MB range’s refcount since it can simply be
indexed.
Since vmalloc mappings support remapping during normal kernel runtime,
switch to an approach that only populates refcount pages for the vmalloc
mapping when there is actually memory for that range. This means any holes
in the physical address space won’t use actual physical memory.
The validity of this memory optimization is based on a couple assumptions:
1. Physical holes in the ram layout are commonly large enough for it to be
worth it.
2. An alternative approach that looks the refcounts via some more layered
data structure wouldn’t overly complicate the lookups. Or at least
more than the complexity of managing the vmalloc mapping.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[Add feedback, update log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v4:
- Fix refcount allocation size calculation. (Kai, Binbin)
- Fix/improve comments. (Kai, Binbin)
- Simplify tdx_find_pamt_refcount() implemenation and callers by making
it take a PFN and calculating it directly rather than going through a
PA intermediate.
- Check tdx_supports_dynamic_pamt() in alloc_pamt_refcount() to prevent
crash when TDX module does not support DPAMT. (Kai)
- Log change refcounters->refcount to be consistent
v3:
- Split from "x86/virt/tdx: Allocate reference counters for
PAMT memory" (Dave)
- Rename tdx_get_pamt_refcount()->tdx_find_pamt_refcount() (Dave)
- Drop duplicate pte_none() check (Dave)
- Align assignments in alloc_pamt_refcount() (Kai)
- Add comment in pamt_refcount_depopulate() to clarify teardown
logic (Dave)
- Drop __va(PFN_PHYS(pte_pfn(ptep_get()))) pile on for simpler method.
(Dave)
- Improve log
---
arch/x86/virt/vmx/tdx/tdx.c | 136 +++++++++++++++++++++++++++++++++---
1 file changed, 125 insertions(+), 11 deletions(-)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c28d4d11736c..edf9182ed86d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -194,30 +194,135 @@ int tdx_cpu_enable(void)
}
EXPORT_SYMBOL_GPL(tdx_cpu_enable);
-/*
- * Allocate PAMT reference counters for all physical memory.
- *
- * It consumes 2MiB for every 1TiB of physical memory.
- */
-static int init_pamt_metadata(void)
+/* Find PAMT refcount for a given physical address */
+static atomic_t *tdx_find_pamt_refcount(unsigned long pfn)
{
- size_t size = DIV_ROUND_UP(max_pfn, PTRS_PER_PTE) * sizeof(*pamt_refcounts);
+ /* Find which PMD a PFN is in. */
+ unsigned long index = pfn >> (PMD_SHIFT - PAGE_SHIFT);
- if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
- return 0;
+ return &pamt_refcounts[index];
+}
- pamt_refcounts = __vmalloc(size, GFP_KERNEL | __GFP_ZERO);
- if (!pamt_refcounts)
+/* Map a page into the PAMT refcount vmalloc region */
+static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
+{
+ struct page *page;
+ pte_t entry;
+
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
return -ENOMEM;
+ entry = mk_pte(page, PAGE_KERNEL);
+
+ spin_lock(&init_mm.page_table_lock);
+ /*
+ * PAMT refcount populations can overlap due to rounding of the
+ * start/end pfn. Make sure the PAMT range is only populated once.
+ */
+ if (pte_none(ptep_get(pte)))
+ set_pte_at(&init_mm, addr, pte, entry);
+ else
+ __free_page(page);
+ spin_unlock(&init_mm.page_table_lock);
+
return 0;
}
+/*
+ * Allocate PAMT reference counters for the given PFN range.
+ *
+ * It consumes 2MiB for every 1TiB of physical memory.
+ */
+static int alloc_pamt_refcount(unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long refcount_first, refcount_last;
+ unsigned long mapping_start, mapping_end;
+
+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+ return 0;
+
+ /*
+ * 'start_pfn' is inclusive and 'end_pfn' is exclusive. Find the
+ * range of refcounts the pfn range will need.
+ */
+ refcount_first = (unsigned long)tdx_find_pamt_refcount(start_pfn);
+ refcount_last = (unsigned long)tdx_find_pamt_refcount(end_pfn - 1);
+
+ /*
+ * Calculate the page aligned range that includes the refcounts. The
+ * teardown logic needs to handle potentially overlapping refcount
+ * mappings resulting from the alignments.
+ */
+ mapping_start = round_down(refcount_first, PAGE_SIZE);
+ mapping_end = round_up(refcount_last + sizeof(*pamt_refcounts), PAGE_SIZE);
+
+
+ return apply_to_page_range(&init_mm, mapping_start, mapping_end - mapping_start,
+ pamt_refcount_populate, NULL);
+}
+
+/*
+ * Reserve vmalloc range for PAMT reference counters. It covers all physical
+ * address space up to max_pfn. It is going to be populated from
+ * build_tdx_memlist() only for present memory that available for TDX use.
+ *
+ * It reserves 2MiB of virtual address space for every 1TiB of physical memory.
+ */
+static int init_pamt_metadata(void)
+{
+ struct vm_struct *area;
+ size_t size;
+
+ if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
+ return 0;
+
+ size = DIV_ROUND_UP(max_pfn, PTRS_PER_PTE) * sizeof(*pamt_refcounts);
+
+ area = get_vm_area(size, VM_SPARSE);
+ if (!area)
+ return -ENOMEM;
+
+ pamt_refcounts = area->addr;
+ return 0;
+}
+
+/* Unmap a page from the PAMT refcount vmalloc region */
+static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr, void *data)
+{
+ struct page *page;
+ pte_t entry;
+
+ spin_lock(&init_mm.page_table_lock);
+
+ entry = ptep_get(pte);
+ /* refcount allocation is sparse, may not be populated */
+ if (!pte_none(entry)) {
+ pte_clear(&init_mm, addr, pte);
+ page = pte_page(entry);
+ __free_page(page);
+ }
+
+ spin_unlock(&init_mm.page_table_lock);
+
+ return 0;
+}
+
+/* Unmap all PAMT refcount pages and free vmalloc range */
static void free_pamt_metadata(void)
{
+ size_t size;
+
if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
return;
+ size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
+ size = round_up(size, PAGE_SIZE);
+
+ apply_to_existing_page_range(&init_mm,
+ (unsigned long)pamt_refcounts,
+ size, pamt_refcount_depopulate,
+ NULL);
vfree(pamt_refcounts);
pamt_refcounts = NULL;
}
@@ -288,10 +393,19 @@ static int build_tdx_memlist(struct list_head *tmb_list)
ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
if (ret)
goto err;
+
+ /* Allocated PAMT refcountes for the memblock */
+ ret = alloc_pamt_refcount(start_pfn, end_pfn);
+ if (ret)
+ goto err;
}
return 0;
err:
+ /*
+ * Only free TDX memory blocks here, PAMT refcount pages
+ * will be freed in the init_tdx_module() error path.
+ */
free_tdx_memlist(tmb_list);
return ret;
}
--
2.51.2
On 21.11.25 г. 2:51 ч., Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> init_pamt_metadata() allocates PAMT refcounts for all physical memory up
> to max_pfn. It might be suboptimal if the physical memory layout is
> discontinuous and has large holes.
>
> The refcount allocation vmalloc allocation. This is necessary to support a
nit: Something's odd with the first sentence, perhaps an "is a" before
is missing before "vmalloc"?
> large allocation size. The virtually contiguous property also makes it
> easy to find a specific 2MB range’s refcount since it can simply be
> indexed.
>
> Since vmalloc mappings support remapping during normal kernel runtime,
> switch to an approach that only populates refcount pages for the vmalloc
> mapping when there is actually memory for that range. This means any holes
> in the physical address space won’t use actual physical memory.
>
> The validity of this memory optimization is based on a couple assumptions:
> 1. Physical holes in the ram layout are commonly large enough for it to be
> worth it.
> 2. An alternative approach that looks the refcounts via some more layered
> data structure wouldn’t overly complicate the lookups. Or at least
> more than the complexity of managing the vmalloc mapping.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> [Add feedback, update log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
<snip>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 136 +++++++++++++++++++++++++++++++++---
> 1 file changed, 125 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index c28d4d11736c..edf9182ed86d 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -194,30 +194,135 @@ int tdx_cpu_enable(void)
> }
> EXPORT_SYMBOL_GPL(tdx_cpu_enable);
>
> -/*
> - * Allocate PAMT reference counters for all physical memory.
> - *
> - * It consumes 2MiB for every 1TiB of physical memory.
> - */
> -static int init_pamt_metadata(void)
> +/* Find PAMT refcount for a given physical address */
> +static atomic_t *tdx_find_pamt_refcount(unsigned long pfn)
> {
> - size_t size = DIV_ROUND_UP(max_pfn, PTRS_PER_PTE) * sizeof(*pamt_refcounts);
> + /* Find which PMD a PFN is in. */
> + unsigned long index = pfn >> (PMD_SHIFT - PAGE_SHIFT);
>
> - if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> - return 0;
> + return &pamt_refcounts[index];
> +}
>
> - pamt_refcounts = __vmalloc(size, GFP_KERNEL | __GFP_ZERO);
> - if (!pamt_refcounts)
> +/* Map a page into the PAMT refcount vmalloc region */
> +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
> +{
> + struct page *page;
> + pte_t entry;
> +
> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + if (!page)
> return -ENOMEM;
>
> + entry = mk_pte(page, PAGE_KERNEL);
> +
> + spin_lock(&init_mm.page_table_lock);
> + /*
> + * PAMT refcount populations can overlap due to rounding of the
> + * start/end pfn. Make sure the PAMT range is only populated once.
> + */
> + if (pte_none(ptep_get(pte)))
> + set_pte_at(&init_mm, addr, pte, entry);
> + else
> + __free_page(page);
> + spin_unlock(&init_mm.page_table_lock);
nit: Wouldn't it be better to perform the pte_none() check before doing
the allocation thus avoiding needless allocations? I.e do the
alloc/mk_pte only after we are 100% sure we are going to use this entry.
> +
> return 0;
> }
<snip>
Kiryl, curious if you have any comments on the below...
On Wed, 2025-11-26 at 16:45 +0200, Nikolay Borisov wrote:
> > +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void
> > *data)
> > +{
> > + struct page *page;
> > + pte_t entry;
> > +
> > + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > + if (!page)
> > return -ENOMEM;
> >
> > + entry = mk_pte(page, PAGE_KERNEL);
> > +
> > + spin_lock(&init_mm.page_table_lock);
> > + /*
> > + * PAMT refcount populations can overlap due to rounding of the
> > + * start/end pfn. Make sure the PAMT range is only populated once.
> > + */
> > + if (pte_none(ptep_get(pte)))
> > + set_pte_at(&init_mm, addr, pte, entry);
> > + else
> > + __free_page(page);
> > + spin_unlock(&init_mm.page_table_lock);
>
> nit: Wouldn't it be better to perform the pte_none() check before doing
> the allocation thus avoiding needless allocations? I.e do the
> alloc/mk_pte only after we are 100% sure we are going to use this entry.
Yes, but I'm also wondering why it needs init_mm.page_table_lock at all. Here is
my reasoning for why it doesn't:
apply_to_page_range() takes init_mm.page_table_lock internally when it modified
page tables in the address range (vmalloc). It needs to do this to avoid races
with other allocations that share the upper level page tables, which could be on
the ends of area that TDX reserves.
But pamt_refcount_populate() is only operating on the PTE's for the address
range that TDX code already controls. Vmalloc should not free the PMD underneath
the PTE operation because there is an allocation in any page tables it covers.
So we can skip the lock and also do the pte_none() check before the page
allocation as Nikolay suggests.
Same for the depopulate path.
On Wed, Nov 26, 2025 at 08:47:07PM +0000, Edgecombe, Rick P wrote:
> Kiryl, curious if you have any comments on the below...
>
> On Wed, 2025-11-26 at 16:45 +0200, Nikolay Borisov wrote:
> > > +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void
> > > *data)
> > > +{
> > > + struct page *page;
> > > + pte_t entry;
> > > +
> > > + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > > + if (!page)
> > > return -ENOMEM;
> > >
> > > + entry = mk_pte(page, PAGE_KERNEL);
> > > +
> > > + spin_lock(&init_mm.page_table_lock);
> > > + /*
> > > + * PAMT refcount populations can overlap due to rounding of the
> > > + * start/end pfn. Make sure the PAMT range is only populated once.
> > > + */
> > > + if (pte_none(ptep_get(pte)))
> > > + set_pte_at(&init_mm, addr, pte, entry);
> > > + else
> > > + __free_page(page);
> > > + spin_unlock(&init_mm.page_table_lock);
> >
> > nit: Wouldn't it be better to perform the pte_none() check before doing
> > the allocation thus avoiding needless allocations? I.e do the
> > alloc/mk_pte only after we are 100% sure we are going to use this entry.
>
> Yes, but I'm also wondering why it needs init_mm.page_table_lock at all. Here is
> my reasoning for why it doesn't:
>
> apply_to_page_range() takes init_mm.page_table_lock internally when it modified
> page tables in the address range (vmalloc). It needs to do this to avoid races
> with other allocations that share the upper level page tables, which could be on
> the ends of area that TDX reserves.
>
> But pamt_refcount_populate() is only operating on the PTE's for the address
> range that TDX code already controls. Vmalloc should not free the PMD underneath
> the PTE operation because there is an allocation in any page tables it covers.
> So we can skip the lock and also do the pte_none() check before the page
> allocation as Nikolay suggests.
>
> Same for the depopulate path.
I cannot remember/find a good reason to keep the locking around.
--
Kiryl Shutsemau / Kirill A. Shutemov
On 26.11.25 г. 22:47 ч., Edgecombe, Rick P wrote:
> Kiryl, curious if you have any comments on the below...
>
> On Wed, 2025-11-26 at 16:45 +0200, Nikolay Borisov wrote:
>>> +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void
>>> *data)
>>> +{
>>> + struct page *page;
>>> + pte_t entry;
>>> +
>>> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>> + if (!page)
>>> return -ENOMEM;
>>>
>>> + entry = mk_pte(page, PAGE_KERNEL);
>>> +
>>> + spin_lock(&init_mm.page_table_lock);
>>> + /*
>>> + * PAMT refcount populations can overlap due to rounding of the
>>> + * start/end pfn. Make sure the PAMT range is only populated once.
>>> + */
>>> + if (pte_none(ptep_get(pte)))
>>> + set_pte_at(&init_mm, addr, pte, entry);
>>> + else
>>> + __free_page(page);
>>> + spin_unlock(&init_mm.page_table_lock);
>>
>> nit: Wouldn't it be better to perform the pte_none() check before doing
>> the allocation thus avoiding needless allocations? I.e do the
>> alloc/mk_pte only after we are 100% sure we are going to use this entry.
>
> Yes, but I'm also wondering why it needs init_mm.page_table_lock at all. Here is
> my reasoning for why it doesn't:
>
> apply_to_page_range() takes init_mm.page_table_lock internally when it modified
> page tables in the address range (vmalloc). It needs to do this to avoid races
> with other allocations that share the upper level page tables, which could be on
> the ends of area that TDX reserves.
> > But pamt_refcount_populate() is only operating on the PTE's for the
address
> range that TDX code already controls. Vmalloc should not free the PMD underneath
> the PTE operation because there is an allocation in any page tables it covers.
> So we can skip the lock and also do the pte_none() check before the page
> allocation as Nikolay suggests.
I agree with your analysis but this needs to be described not only in
the commit message but also as a code comment because you intentionally
omit locking since that particular pte (at that point) can only have a
single user so no race conditions are possible.
>
> Same for the depopulate path.
On 11/21/2025 8:51 AM, Rick Edgecombe wrote:
[...]
> +
> +/* Unmap a page from the PAMT refcount vmalloc region */
> +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr, void *data)
> +{
> + struct page *page;
> + pte_t entry;
> +
> + spin_lock(&init_mm.page_table_lock);
> +
> + entry = ptep_get(pte);
> + /* refcount allocation is sparse, may not be populated */
Not sure this comment about "sparse" is accurate since this function is called via
apply_to_existing_page_range().
And the check for not present just for sanity check?
> + if (!pte_none(entry)) {
> + pte_clear(&init_mm, addr, pte);
> + page = pte_page(entry);
> + __free_page(page);
> + }
> +
> + spin_unlock(&init_mm.page_table_lock);
> +
> + return 0;
> +}
> +
> +/* Unmap all PAMT refcount pages and free vmalloc range */
> static void free_pamt_metadata(void)
> {
> + size_t size;
> +
> if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> return;
>
> + size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> + size = round_up(size, PAGE_SIZE);
> +
> + apply_to_existing_page_range(&init_mm,
> + (unsigned long)pamt_refcounts,
> + size, pamt_refcount_depopulate,
> + NULL);
> vfree(pamt_refcounts);
> pamt_refcounts = NULL;
> }
> @@ -288,10 +393,19 @@ static int build_tdx_memlist(struct list_head *tmb_list)
> ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
> if (ret)
> goto err;
> +
> + /* Allocated PAMT refcountes for the memblock */
> + ret = alloc_pamt_refcount(start_pfn, end_pfn);
> + if (ret)
> + goto err;
> }
>
> return 0;
> err:
> + /*
> + * Only free TDX memory blocks here, PAMT refcount pages
> + * will be freed in the init_tdx_module() error path.
> + */
> free_tdx_memlist(tmb_list);
> return ret;
> }
On Tue, 2025-11-25 at 11:15 +0800, Binbin Wu wrote:
> On 11/21/2025 8:51 AM, Rick Edgecombe wrote:
> [...]
> > +
> > +/* Unmap a page from the PAMT refcount vmalloc region */
> > +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr, void *data)
> > +{
> > + struct page *page;
> > + pte_t entry;
> > +
> > + spin_lock(&init_mm.page_table_lock);
> > +
> > + entry = ptep_get(pte);
> > + /* refcount allocation is sparse, may not be populated */
>
> Not sure this comment about "sparse" is accurate since this function is called via
> apply_to_existing_page_range().
>
> And the check for not present just for sanity check?
Yes, I don't see what that comment is referring to. But we do need it, because
hypothetically the refcount mapping could have failed halfway. So we will have
pte_none()s for the ranges that didn't get populated. I'll use:
/* Refcount mapping could have failed part way, handle aborted mappings. */
On Wed, Nov 26, 2025 at 08:47:19PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2025-11-25 at 11:15 +0800, Binbin Wu wrote:
> > On 11/21/2025 8:51 AM, Rick Edgecombe wrote:
> > [...]
> > > +
> > > +/* Unmap a page from the PAMT refcount vmalloc region */
> > > +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr, void *data)
> > > +{
> > > + struct page *page;
> > > + pte_t entry;
> > > +
> > > + spin_lock(&init_mm.page_table_lock);
> > > +
> > > + entry = ptep_get(pte);
> > > + /* refcount allocation is sparse, may not be populated */
> >
> > Not sure this comment about "sparse" is accurate since this function is called via
> > apply_to_existing_page_range().
> >
> > And the check for not present just for sanity check?
>
> Yes, I don't see what that comment is referring to. But we do need it, because
> hypothetically the refcount mapping could have failed halfway. So we will have
> pte_none()s for the ranges that didn't get populated. I'll use:
>
> /* Refcount mapping could have failed part way, handle aborted mappings. */
It is possible that we can have holes in physical address space between
0 and max_pfn. You need the check even outside of "failed halfway"
scenario.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, 2025-11-27 at 15:57 +0000, Kiryl Shutsemau wrote: > > Yes, I don't see what that comment is referring to. But we do need it, > > because hypothetically the refcount mapping could have failed halfway. So we > > will have pte_none()s for the ranges that didn't get populated. I'll use: > > > > /* Refcount mapping could have failed part way, handle aborted mappings. */ > > It is possible that we can have holes in physical address space between > 0 and max_pfn. You need the check even outside of "failed halfway" > scenario. Err. right. Was thinking of for_each_mem_pfn_range() on the populate side. pamt_refcount_depopulate() is just called with the whole refcount virtual address range. I'll add both reasons to the comment.
© 2016 - 2025 Red Hat, Inc.