[PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()

Lorenzo Stoakes (Oracle) posted 13 patches 2 weeks ago
[PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Lorenzo Stoakes (Oracle) 2 weeks ago
Rather than thread has_deposited through zap_huge_pmd(), make things
clearer by adding has_deposited_pgtable() with comments describing why in
each case.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 mm/huge_memory.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5831966391bd..610a6184e92c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2326,8 +2326,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 }
 
 static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
-		pmd_t pmdval, struct folio *folio, bool is_present,
-		bool *has_deposit)
+		pmd_t pmdval, struct folio *folio, bool is_present)
 {
 	const bool is_device_private = folio_is_device_private(folio);
 
@@ -2336,7 +2335,6 @@ static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma,
 		folio_remove_rmap_pmd(folio, &folio->page, vma);
 
 	if (folio_test_anon(folio)) {
-		*has_deposit = true;
 		add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 	} else {
 		add_mm_counter(mm, mm_counter_file(folio),
@@ -2363,6 +2361,27 @@ static struct folio *normal_or_softleaf_folio_pmd(struct vm_area_struct *vma,
 	return pmd_to_softleaf_folio(pmdval);
 }
 
+static bool has_deposited_pgtable(struct vm_area_struct *vma, pmd_t pmdval,
+		struct folio *folio)
+{
+	/* Some architectures require unconditional depositing. */
+	if (arch_needs_pgtable_deposit())
+		return true;
+
+	/*
+	 * Huge zero always deposited except for DAX which handles itself, see
+	 * set_huge_zero_folio().
+	 */
+	if (is_huge_zero_pmd(pmdval))
+		return !vma_is_dax(vma);
+
+	/*
+	 * Otherwise, only anonymous folios are deposited, see
+	 * __do_huge_pmd_anonymous_page().
+	 */
+	return folio && folio_test_anon(folio);
+}
+
 /**
  * zap_huge_pmd - Zap a huge THP which is of PMD size.
  * @tlb: The MMU gather TLB state associated with the operation.
@@ -2375,7 +2394,6 @@ static struct folio *normal_or_softleaf_folio_pmd(struct vm_area_struct *vma,
 bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
-	bool has_deposit = arch_needs_pgtable_deposit();
 	struct mm_struct *mm = tlb->mm;
 	struct folio *folio = NULL;
 	bool is_present = false;
@@ -2401,12 +2419,9 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	is_present = pmd_present(orig_pmd);
 	folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
 	if (folio)
-		zap_huge_pmd_folio(mm, vma, orig_pmd, folio, is_present,
-				   &has_deposit);
-	else if (is_huge_zero_pmd(orig_pmd))
-		has_deposit = !vma_is_dax(vma);
+		zap_huge_pmd_folio(mm, vma, orig_pmd, folio, is_present);
 
-	if (has_deposit)
+	if (has_deposited_pgtable(vma, orig_pmd, folio))
 		zap_deposited_table(mm, pmd);
 
 	spin_unlock(ptl);
-- 
2.53.0
Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Yin Tirui 1 day, 23 hours ago

On 3/21/26 02:07, Lorenzo Stoakes (Oracle) wrote:

> +static bool has_deposited_pgtable(struct vm_area_struct *vma, pmd_t pmdval,
> +		struct folio *folio)
> +{
> +	/* Some architectures require unconditional depositing. */
> +	if (arch_needs_pgtable_deposit())
> +		return true;
> +
> +	/*
> +	 * Huge zero always deposited except for DAX which handles itself, see
> +	 * set_huge_zero_folio().
> +	 */
> +	if (is_huge_zero_pmd(pmdval))
> +		return !vma_is_dax(vma);
> +
> +	/*
> +	 * Otherwise, only anonymous folios are deposited, see
> +	 * __do_huge_pmd_anonymous_page().
> +	 */
> +	return folio && folio_test_anon(folio);
> +}

Hi Lorenzo,

I just wanted to mention a potential intersection with my upcoming v4 of
the "mm: add huge pfnmap support for remap_pfn_range()" series [1].

To safely support PMD splitting on partial unmaps, my series makes
VM_PFNMAP huge pages allocate and deposit page tables. However, they
neither have an underlying struct folio nor are anonymous.

Given the has_deposited_pgtable() helper, do you have any suggestions on
what would be the cleanest way to integrate the VM_PFNMAP condition into it?

[1]
https://lore.kernel.org/linux-mm/20260228070906.1418911-5-yintirui@huawei.com/

-- 
Yin Tirui
Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Lorenzo Stoakes (Oracle) 1 day, 20 hours ago
On Thu, Apr 02, 2026 at 11:19:07AM +0800, Yin Tirui wrote:
>
>
> On 3/21/26 02:07, Lorenzo Stoakes (Oracle) wrote:
>
> > +static bool has_deposited_pgtable(struct vm_area_struct *vma, pmd_t pmdval,
> > +		struct folio *folio)
> > +{
> > +	/* Some architectures require unconditional depositing. */
> > +	if (arch_needs_pgtable_deposit())
> > +		return true;
> > +
> > +	/*
> > +	 * Huge zero always deposited except for DAX which handles itself, see
> > +	 * set_huge_zero_folio().
> > +	 */
> > +	if (is_huge_zero_pmd(pmdval))
> > +		return !vma_is_dax(vma);
> > +
> > +	/*
> > +	 * Otherwise, only anonymous folios are deposited, see
> > +	 * __do_huge_pmd_anonymous_page().
> > +	 */
> > +	return folio && folio_test_anon(folio);
> > +}
>
> Hi Lorenzo,
>
> I just wanted to mention a potential intersection with my upcoming v4 of
> the "mm: add huge pfnmap support for remap_pfn_range()" series [1].
>
> To safely support PMD splitting on partial unmaps, my series makes
> VM_PFNMAP huge pages allocate and deposit page tables. However, they
> neither have an underlying struct folio nor are anonymous.
>
> Given the has_deposited_pgtable() helper, do you have any suggestions on
> what would be the cleanest way to integrate the VM_PFNMAP condition into it?


I mean you would have needed to handle this case in any event, since this change
is strictly an equivalent reworking of zap_huge_pmd().

But it seems that doing so has clarified the requirements somewhat here :)

I haven't had a look at that series yet (please cc this email if you weren't
already, I do filter a lot of stuff due to how much mail I get daily)

So if this is a PMD leaf entry it will be present and PFN map, so I'd have
thought simply adding:

	/* Huge PFN map must deposit, as cannot refault. */
	if (vma_test(vma, VMA_PFNMAP_BIT))
		return true;

Would suffice?

By the way, I am wondering if the prot bits are correctly preserved on page
table deposit, as this is key for pfn map (e.g. if the range is uncached, for
instance). That's something to check and ensure is correct.

I _suspect_ they will be, as we have pretty well established mechanisms for that
(propagate vma->vm_page_prot etc.) but definitely worth making sure.

>
> [1]
> https://lore.kernel.org/linux-mm/20260228070906.1418911-5-yintirui@huawei.com/
>
> --
> Yin Tirui
>

Cheers, Lorenzo
Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Yin Tirui 1 day, 19 hours ago

On 4/2/26 14:46, Lorenzo Stoakes (Oracle) wrote:
> 
> I mean you would have needed to handle this case in any event, since this change
> is strictly an equivalent reworking of zap_huge_pmd().
> 
> But it seems that doing so has clarified the requirements somewhat here :)
> 
> I haven't had a look at that series yet (please cc this email if you weren't
> already, I do filter a lot of stuff due to how much mail I get daily)

Hi Lorenzo,

Thanks for the quick reply. I will definitely CC you on the v4 series.

> 
> So if this is a PMD leaf entry it will be present and PFN map, so I'd have
> thought simply adding:
> 
> 	/* Huge PFN map must deposit, as cannot refault. */
> 	if (vma_test(vma, VMA_PFNMAP_BIT))
> 		return true;
> 
> Would suffice?

Here is the dilemma:

Currently, VFIO uses vmf_insert_pfn_pmd() to create huge pfnmaps on page
faults. This sets VM_PFNMAP in vfio_pci_core_mmap(), but it does not
deposit a pgtable (unless arch_needs_pgtable_deposit() is true).

To resolve this,

Option A: Force VFIO (vmf_insert_pfn_pmd) to also deposit pgtables. This
unifies the VM_PFNMAP lifecycle. However, since VFIO can refault,
depositing pgtables here incurs unnecessary memory overhead.

Option B: Introduce a new VMA flag set during remap_pfn_range(), which
we can explicitly check in has_deposited_pgtable().

Option C: Check vma->vm_ops->fault (and huge_fault). We would only
deposit pgtables for mappings without fault handlers. However, this is
fragile because a driver might still register a .fault() handler that
simply returns VM_FAULT_SIGBUS.

Do you have a preference among these, or perhaps another idea?

> 
> By the way, I am wondering if the prot bits are correctly preserved on page
> table deposit, as this is key for pfn map (e.g. if the range is uncached, for
> instance). That's something to check and ensure is correct.
> 
> I _suspect_ they will be, as we have pretty well established mechanisms for that
> (propagate vma->vm_page_prot etc.) but definitely worth making sure.
> 

Yes, they are correctly preserved!

During a PMD split in __split_huge_pmd_locked(), we populate the
deposited pgtable like this:

    entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd));
    set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);

The newly refactored pmd_pgprot() correctly extracts the exact
protection bits (including crucial cache modes like UC/WC for device
memory) from the huge PMD, strips the hardware-specific huge bit, and
returns a pure PTE-level pgprot_t.

>>
>> [1]
>> https://lore.kernel.org/linux-mm/20260228070906.1418911-5-yintirui@huawei.com/
>>
>> --
>> Yin Tirui
>>
> 
> Cheers, Lorenzo

-- 
Yin Tirui
Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Lorenzo Stoakes (Oracle) 1 week, 4 days ago
Hi Andrew,

Could you apply the below fix-patch to resolve an issue with us performing
folio_put() on a folio before checking it again to see if a table was
deposited, as per Sashiko.

This patch resolves the issue by storing whether or not this is the case in
a has_deposit local variable (as used previously) before invoking
zap_huge_pmd_folio(), then using this boolean to determine whether or not
to zap any deposited table.

Thanks, Lorenzo

----8<----
From 009f8abba834b49f8285b03a680dbd04d953a528 Mon Sep 17 00:00:00 2001
From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Date: Mon, 23 Mar 2026 11:42:01 +0000
Subject: [PATCH] fix

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 mm/huge_memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 610a6184e92c..4585465eda0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2397,6 +2397,7 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct mm_struct *mm = tlb->mm;
 	struct folio *folio = NULL;
 	bool is_present = false;
+	bool has_deposit;
 	spinlock_t *ptl;
 	pmd_t orig_pmd;

@@ -2420,8 +2421,8 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
 	if (folio)
 		zap_huge_pmd_folio(mm, vma, orig_pmd, folio, is_present);
-
-	if (has_deposited_pgtable(vma, orig_pmd, folio))
+	has_deposit = has_deposited_pgtable(vma, orig_pmd, folio);
+	if (has_deposit)
 		zap_deposited_table(mm, pmd);

 	spin_unlock(ptl);
--
2.53.0
Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Lorenzo Stoakes (Oracle) 1 week, 4 days ago
On Mon, Mar 23, 2026 at 11:45:23AM +0000, Lorenzo Stoakes (Oracle) wrote:
> Hi Andrew,
>
> Could you apply the below fix-patch to resolve an issue with us performing
> folio_put() on a folio before checking it again to see if a table was
> deposited, as per Sashiko.
>
> This patch resolves the issue by storing whether or not this is the case in
> a has_deposit local variable (as used previously) before invoking
> zap_huge_pmd_folio(), then using this boolean to determine whether or not
> to zap any deposited table.

Oops this is wrong...!

Please apply the below instead :)

Thanks, Lorenzo

----8<----
From e6d58747d00dd954c605201e97f8b769b2ba8cf8 Mon Sep 17 00:00:00 2001
From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Date: Mon, 23 Mar 2026 11:42:01 +0000
Subject: [PATCH] fix

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 mm/huge_memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 610a6184e92c..b2a6060b3c20 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2397,6 +2397,7 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct mm_struct *mm = tlb->mm;
 	struct folio *folio = NULL;
 	bool is_present = false;
+	bool has_deposit;
 	spinlock_t *ptl;
 	pmd_t orig_pmd;

@@ -2418,10 +2419,10 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

 	is_present = pmd_present(orig_pmd);
 	folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
+	has_deposit = has_deposited_pgtable(vma, orig_pmd, folio);
 	if (folio)
 		zap_huge_pmd_folio(mm, vma, orig_pmd, folio, is_present);
-
-	if (has_deposited_pgtable(vma, orig_pmd, folio))
+	if (has_deposit)
 		zap_deposited_table(mm, pmd);

 	spin_unlock(ptl);
--
2.53.0
Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Suren Baghdasaryan 6 days, 7 hours ago
On Mon, Mar 23, 2026 at 5:25 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Mon, Mar 23, 2026 at 11:45:23AM +0000, Lorenzo Stoakes (Oracle) wrote:
> > Hi Andrew,
> >
> > Could you apply the below fix-patch to resolve an issue with us performing
> > folio_put() on a folio before checking it again to see if a table was
> > deposited, as per Sashiko.
> >
> > This patch resolves the issue by storing whether or not this is the case in
> > a has_deposit local variable (as used previously) before invoking
> > zap_huge_pmd_folio(), then using this boolean to determine whether or not
> > to zap any deposited table.
>
> Oops this is wrong...!
>
> Please apply the below instead :)
>
> Thanks, Lorenzo

With the fix LGTM. Overall very nice cleanup making the code much more
readable. Thanks Lorenzo!

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> ----8<----
> From e6d58747d00dd954c605201e97f8b769b2ba8cf8 Mon Sep 17 00:00:00 2001
> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> Date: Mon, 23 Mar 2026 11:42:01 +0000
> Subject: [PATCH] fix
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
>  mm/huge_memory.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 610a6184e92c..b2a6060b3c20 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2397,6 +2397,7 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>         struct mm_struct *mm = tlb->mm;
>         struct folio *folio = NULL;
>         bool is_present = false;
> +       bool has_deposit;
>         spinlock_t *ptl;
>         pmd_t orig_pmd;
>
> @@ -2418,10 +2419,10 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>
>         is_present = pmd_present(orig_pmd);
>         folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
> +       has_deposit = has_deposited_pgtable(vma, orig_pmd, folio);
>         if (folio)
>                 zap_huge_pmd_folio(mm, vma, orig_pmd, folio, is_present);
> -
> -       if (has_deposited_pgtable(vma, orig_pmd, folio))
> +       if (has_deposit)
>                 zap_deposited_table(mm, pmd);
>
>         spin_unlock(ptl);
> --
> 2.53.0
Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
Posted by Lorenzo Stoakes (Oracle) 4 days, 17 hours ago
On Sat, Mar 28, 2026 at 12:54:38PM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 23, 2026 at 5:25 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:45:23AM +0000, Lorenzo Stoakes (Oracle) wrote:
> > > Hi Andrew,
> > >
> > > Could you apply the below fix-patch to resolve an issue with us performing
> > > folio_put() on a folio before checking it again to see if a table was
> > > deposited, as per Sashiko.
> > >
> > > This patch resolves the issue by storing whether or not this is the case in
> > > a has_deposit local variable (as used previously) before invoking
> > > zap_huge_pmd_folio(), then using this boolean to determine whether or not
> > > to zap any deposited table.
> >
> > Oops this is wrong...!
> >
> > Please apply the below instead :)
> >
> > Thanks, Lorenzo
>
> With the fix LGTM. Overall very nice cleanup making the code much more
> readable. Thanks Lorenzo!

Thanks :)

>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks for the review!

Cheers, Lorenzo

>
> >
> > ----8<----
> > From e6d58747d00dd954c605201e97f8b769b2ba8cf8 Mon Sep 17 00:00:00 2001
> > From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> > Date: Mon, 23 Mar 2026 11:42:01 +0000
> > Subject: [PATCH] fix
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  mm/huge_memory.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 610a6184e92c..b2a6060b3c20 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2397,6 +2397,7 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >         struct mm_struct *mm = tlb->mm;
> >         struct folio *folio = NULL;
> >         bool is_present = false;
> > +       bool has_deposit;
> >         spinlock_t *ptl;
> >         pmd_t orig_pmd;
> >
> > @@ -2418,10 +2419,10 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >
> >         is_present = pmd_present(orig_pmd);
> >         folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present);
> > +       has_deposit = has_deposited_pgtable(vma, orig_pmd, folio);
> >         if (folio)
> >                 zap_huge_pmd_folio(mm, vma, orig_pmd, folio, is_present);
> > -
> > -       if (has_deposited_pgtable(vma, orig_pmd, folio))
> > +       if (has_deposit)
> >                 zap_deposited_table(mm, pmd);
> >
> >         spin_unlock(ptl);
> > --
> > 2.53.0