[PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL

Kiryl Shutsemau posted 1 patch 2 weeks, 5 days ago
There is a newer version of this series
mm/khugepaged.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
[PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Kiryl Shutsemau 2 weeks, 5 days ago
From: Kiryl Shutsemau <kas@kernel.org>

MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
PMD page table is installed or not.

Consider following example:

	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
		 MAP_SHARED, fd, 0);
	err = madvise(p, 2UL << 20, MADV_COLLAPSE);

fd is a populated tmpfs file.

The result depends on the address that the kernel returns on mmap().
If it is located in an existing PMD table, the madvise() will succeed.
However, if the table does not exist, it will fail with -EINVAL.

This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
a page table is missing, which causes collapse_pte_mapped_thp() to fail.

SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/khugepaged.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b486c1d19b2d..9e76a4f46df9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1488,6 +1488,28 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return SCAN_SUCCEED;
 }
 
+static int install_huge_pmd(struct vm_area_struct *vma, unsigned long haddr,
+			    pmd_t *pmd, struct folio *folio)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+
+	pgd = pgd_offset(mm, haddr);
+	p4d = p4d_alloc(mm, pgd, haddr);
+	if (!p4d)
+		return SCAN_FAIL;
+	pud = pud_alloc(mm, p4d, haddr);
+	if (!pud)
+		return SCAN_FAIL;
+	pmd = pmd_alloc(mm, pud, haddr);
+	if (!pmd)
+		return SCAN_FAIL;
+
+	return set_huge_pmd(vma, haddr, pmd, folio, &folio->page);
+}
+
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1556,6 +1578,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
+	case SCAN_PMD_NULL:
 	case SCAN_PMD_NONE:
 		/*
 		 * All pte entries have been removed and pmd cleared.
@@ -1700,7 +1723,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 maybe_install_pmd:
 	/* step 5: install pmd entry */
 	result = install_pmd
-			? set_huge_pmd(vma, haddr, pmd, folio, &folio->page)
+			? install_huge_pmd(vma, haddr, pmd, folio)
 			: SCAN_SUCCEED;
 	goto drop_folio;
 abort:
-- 
2.50.1
Re: [PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Baolin Wang 2 weeks, 3 days ago

On 2025/9/13 00:58, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
> 
> Consider following example:
> 
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> 
> fd is a populated tmpfs file.
> 
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
> 
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> 
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---

Make sense to me. Thanks.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Re: [PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Dev Jain 2 weeks, 4 days ago
On 12/09/25 10:28 pm, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.

Thanks.

Besides this patch, the label maybe_install_pmd is misleading -
SCAN_PMD_NONE means that the pmd table exists, just that the pmd
entry is none, so the pmd is already installed. Along with this,
the argument bool install_pmd should likewise be install_huge_pmd.

>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>   mm/khugepaged.c | 25 ++++++++++++++++++++++++-
>   1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b486c1d19b2d..9e76a4f46df9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1488,6 +1488,28 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   	return SCAN_SUCCEED;
>   }
>   
> +static int install_huge_pmd(struct vm_area_struct *vma, unsigned long haddr,
> +			    pmd_t *pmd, struct folio *folio)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +
> +	pgd = pgd_offset(mm, haddr);
> +	p4d = p4d_alloc(mm, pgd, haddr);
> +	if (!p4d)
> +		return SCAN_FAIL;
> +	pud = pud_alloc(mm, p4d, haddr);
> +	if (!pud)
> +		return SCAN_FAIL;
> +	pmd = pmd_alloc(mm, pud, haddr);
> +	if (!pmd)
> +		return SCAN_FAIL;
> +
> +	return set_huge_pmd(vma, haddr, pmd, folio, &folio->page);
> +}
> +

For the SCAN_PMD_NONE case, we are unconditionally traversing the pagetables
now which is not needed. How about, in set_huge_pmd(), we pass a boolean install_pmd,
and at the start of the function, call install_pmd() which will do the traversal
and the pmd_alloc()? That will also make it crystal clear that in the SCAN_PMD_NULL
case, we are first installing the PMD table and then setting it to huge. Right now
the distinction between the two cases is not clear.
  

>   /**
>    * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
>    * address haddr.
> @@ -1556,6 +1578,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   	switch (result) {
>   	case SCAN_SUCCEED:
>   		break;
> +	case SCAN_PMD_NULL:
>   	case SCAN_PMD_NONE:
>   		/*
>   		 * All pte entries have been removed and pmd cleared.
> @@ -1700,7 +1723,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   maybe_install_pmd:
>   	/* step 5: install pmd entry */
>   	result = install_pmd
> -			? set_huge_pmd(vma, haddr, pmd, folio, &folio->page)
> +			? install_huge_pmd(vma, haddr, pmd, folio)
>   			: SCAN_SUCCEED;
>   	goto drop_folio;
>   abort:
Re: [PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Kiryl Shutsemau 2 weeks, 3 days ago
On Sun, Sep 14, 2025 at 12:56:13PM +0530, Dev Jain wrote:
> 
> On 12/09/25 10:28 pm, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> > 
> > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> > PMD page table is installed or not.
> > 
> > Consider following example:
> > 
> > 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> > 		 MAP_SHARED, fd, 0);
> > 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> > 
> > fd is a populated tmpfs file.
> > 
> > The result depends on the address that the kernel returns on mmap().
> > If it is located in an existing PMD table, the madvise() will succeed.
> > However, if the table does not exist, it will fail with -EINVAL.
> > 
> > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> > a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> > 
> > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> > tables as needed.
> 
> Thanks.
> 
> Besides this patch, the label maybe_install_pmd is misleading -
> SCAN_PMD_NONE means that the pmd table exists, just that the pmd
> entry is none, so the pmd is already installed.

That's never ending confusion between PTE/PMD/P?D entry and table.
Addressing it is out of scope of the patch :P

> Along with this,
> the argument bool install_pmd should likewise be install_huge_pmd.

Well, if you rename install_pmd to install_huge_pmd it will overshadow
the install_huge_pmd() function. And the label name is not a problem in
my view.

> 
> > 
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > ---
> >   mm/khugepaged.c | 25 ++++++++++++++++++++++++-
> >   1 file changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b486c1d19b2d..9e76a4f46df9 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1488,6 +1488,28 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> >   	return SCAN_SUCCEED;
> >   }
> > +static int install_huge_pmd(struct vm_area_struct *vma, unsigned long haddr,
> > +			    pmd_t *pmd, struct folio *folio)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	pgd_t *pgd;
> > +	p4d_t *p4d;
> > +	pud_t *pud;
> > +
> > +	pgd = pgd_offset(mm, haddr);
> > +	p4d = p4d_alloc(mm, pgd, haddr);
> > +	if (!p4d)
> > +		return SCAN_FAIL;
> > +	pud = pud_alloc(mm, p4d, haddr);
> > +	if (!pud)
> > +		return SCAN_FAIL;
> > +	pmd = pmd_alloc(mm, pud, haddr);
> > +	if (!pmd)
> > +		return SCAN_FAIL;
> > +
> > +	return set_huge_pmd(vma, haddr, pmd, folio, &folio->page);
> > +}
> > +
> 
> For the SCAN_PMD_NONE case, we are unconditionally traversing the pagetables
> now which is not needed. How about, in set_huge_pmd(), we pass a boolean install_pmd,
> and at the start of the function, call install_pmd() which will do the traversal
> and the pmd_alloc()? That will also make it crystal clear that in the SCAN_PMD_NULL
> case, we are first installing the PMD table and then setting it to huge. Right now
> the distinction between the two cases is not clear.

I just realized that my install_huge_pmd() doesn't use pmd that is pass
in. And looking at code again, I think it is better to integrate the
page table allocation directly into set_huge_pmd().

See the patch below. I will submit it as v2, if there's no objections.

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b486c1d19b2d..986718599355 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct folio *folio, struct page *page)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct vm_fault vmf = {
 		.vma = vma,
 		.address = addr,
 		.flags = 0,
-		.pmd = pmdp,
 	};
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
 
 	mmap_assert_locked(vma->vm_mm);
 
+	if (!pmdp) {
+		pgdp = pgd_offset(mm, addr);
+		p4dp = p4d_alloc(mm, pgdp, addr);
+		if (!p4dp)
+			return SCAN_FAIL;
+		pudp = pud_alloc(mm, p4dp, addr);
+		if (!pudp)
+			return SCAN_FAIL;
+		pmdp = pmd_alloc(mm, pudp, addr);
+		if (!pmdp)
+			return SCAN_FAIL;
+	}
+
+	vmf.pmd = pmdp;
 	if (do_set_pmd(&vmf, folio, page))
 		return SCAN_FAIL;
 
@@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
+	case SCAN_PMD_NULL:
 	case SCAN_PMD_NONE:
 		/*
 		 * All pte entries have been removed and pmd cleared.
-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Lorenzo Stoakes 2 weeks, 5 days ago
On Fri, Sep 12, 2025 at 05:58:11PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

LGTM, nice!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/khugepaged.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b486c1d19b2d..9e76a4f46df9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1488,6 +1488,28 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	return SCAN_SUCCEED;
>  }
>
> +static int install_huge_pmd(struct vm_area_struct *vma, unsigned long haddr,
> +			    pmd_t *pmd, struct folio *folio)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +
> +	pgd = pgd_offset(mm, haddr);
> +	p4d = p4d_alloc(mm, pgd, haddr);
> +	if (!p4d)
> +		return SCAN_FAIL;
> +	pud = pud_alloc(mm, p4d, haddr);
> +	if (!pud)
> +		return SCAN_FAIL;
> +	pmd = pmd_alloc(mm, pud, haddr);
> +	if (!pmd)
> +		return SCAN_FAIL;
> +
> +	return set_huge_pmd(vma, haddr, pmd, folio, &folio->page);
> +}
> +
>  /**
>   * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
>   * address haddr.
> @@ -1556,6 +1578,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	switch (result) {
>  	case SCAN_SUCCEED:
>  		break;
> +	case SCAN_PMD_NULL:
>  	case SCAN_PMD_NONE:
>  		/*
>  		 * All pte entries have been removed and pmd cleared.
> @@ -1700,7 +1723,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  maybe_install_pmd:
>  	/* step 5: install pmd entry */
>  	result = install_pmd
> -			? set_huge_pmd(vma, haddr, pmd, folio, &folio->page)
> +			? install_huge_pmd(vma, haddr, pmd, folio)
>  			: SCAN_SUCCEED;
>  	goto drop_folio;
>  abort:
> --
> 2.50.1
Re: [PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by David Hildenbrand 2 weeks, 5 days ago
On 12.09.25 18:58, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
> 
> Consider following example:
> 
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> 
> fd is a populated tmpfs file.
> 
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
> 
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> 
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---

Makes sense to me.

Is this something we want a Fixes: or even Cc: stable for?

I assume it doesn't really happen frequently, but could also happen 
after MADV_DONTNEED'ing the full range with pt reclaim I think after 
having faulted in some PTEs, for example.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb
Re: [PATCH] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Kiryl Shutsemau 2 weeks, 5 days ago
On Fri, Sep 12, 2025 at 07:13:48PM +0200, David Hildenbrand wrote:
> On 12.09.25 18:58, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> > 
> > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> > PMD page table is installed or not.
> > 
> > Consider following example:
> > 
> > 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> > 		 MAP_SHARED, fd, 0);
> > 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> > 
> > fd is a populated tmpfs file.
> > 
> > The result depends on the address that the kernel returns on mmap().
> > If it is located in an existing PMD table, the madvise() will succeed.
> > However, if the table does not exist, it will fail with -EINVAL.
> > 
> > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> > a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> > 
> > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> > tables as needed.
> > 
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > ---
> 
> Makes sense to me.
> 
> Is this something we want a Fixes: or even Cc: stable for?

I am not sure if it is stable@ matter.

I believe it is there from the start:

Fixes: 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")

-- 
  Kiryl Shutsemau / Kirill A. Shutemov