[PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL

Kiryl Shutsemau posted 1 patch 2 weeks, 3 days ago
mm/khugepaged.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
[PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Kiryl Shutsemau 2 weeks, 3 days ago
From: Kiryl Shutsemau <kas@kernel.org>

MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
PMD page table is installed or not.

Consider following example:

	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
		 MAP_SHARED, fd, 0);
	err = madvise(p, 2UL << 20, MADV_COLLAPSE);

fd is a populated tmpfs file.

The result depends on the address that the kernel returns on mmap().
If it is located in an existing PMD table, the madvise() will succeed.
However, if the table does not exist, it will fail with -EINVAL.

This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
a page table is missing, which causes collapse_pte_mapped_thp() to fail.

SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---

v2:
 - Modify set_huge_pmd() instead of introducing install_huge_pmd();

---
 mm/khugepaged.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b486c1d19b2d..986718599355 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct folio *folio, struct page *page)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct vm_fault vmf = {
 		.vma = vma,
 		.address = addr,
 		.flags = 0,
-		.pmd = pmdp,
 	};
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
 
 	mmap_assert_locked(vma->vm_mm);
 
+	if (!pmdp) {
+		pgdp = pgd_offset(mm, addr);
+		p4dp = p4d_alloc(mm, pgdp, addr);
+		if (!p4dp)
+			return SCAN_FAIL;
+		pudp = pud_alloc(mm, p4dp, addr);
+		if (!pudp)
+			return SCAN_FAIL;
+		pmdp = pmd_alloc(mm, pudp, addr);
+		if (!pmdp)
+			return SCAN_FAIL;
+	}
+
+	vmf.pmd = pmdp;
 	if (do_set_pmd(&vmf, folio, page))
 		return SCAN_FAIL;
 
@@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
+	case SCAN_PMD_NULL:
 	case SCAN_PMD_NONE:
 		/*
 		 * All pte entries have been removed and pmd cleared.
-- 
2.50.1
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Baolin Wang 1 week, 6 days ago

On 2025/9/15 21:52, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
> 
> Consider following example:
> 
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> 
> fd is a populated tmpfs file.
> 
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
> 
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> 
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Lorenzo Stoakes 2 weeks, 2 days ago
On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

There was a v1 with tags, you've not propagated any of them? Did you feel
the change was enough to remove them?

Anyway, LGTM so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>
> v2:
>  - Modify set_huge_pmd() instead of introducing install_huge_pmd();
>
> ---
>  mm/khugepaged.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b486c1d19b2d..986718599355 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>  static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			pmd_t *pmdp, struct folio *folio, struct page *page)
>  {
> +	struct mm_struct *mm = vma->vm_mm;
>  	struct vm_fault vmf = {
>  		.vma = vma,
>  		.address = addr,
>  		.flags = 0,
> -		.pmd = pmdp,
>  	};
> +	pgd_t *pgdp;
> +	p4d_t *p4dp;
> +	pud_t *pudp;
>
>  	mmap_assert_locked(vma->vm_mm);

NIT: you have mm as a local var should use here too. Not a big deal though
obviously...

>
> +	if (!pmdp) {
> +		pgdp = pgd_offset(mm, addr);
> +		p4dp = p4d_alloc(mm, pgdp, addr);
> +		if (!p4dp)
> +			return SCAN_FAIL;
> +		pudp = pud_alloc(mm, p4dp, addr);
> +		if (!pudp)
> +			return SCAN_FAIL;
> +		pmdp = pmd_alloc(mm, pudp, addr);
> +		if (!pmdp)
> +			return SCAN_FAIL;
> +	}
> +
> +	vmf.pmd = pmdp;
>  	if (do_set_pmd(&vmf, folio, page))
>  		return SCAN_FAIL;
>
> @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	switch (result) {
>  	case SCAN_SUCCEED:
>  		break;
> +	case SCAN_PMD_NULL:
>  	case SCAN_PMD_NONE:
>  		/*
>  		 * All pte entries have been removed and pmd cleared.
> --
> 2.50.1
>
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Kiryl Shutsemau 2 weeks, 1 day ago
On Tue, Sep 16, 2025 at 10:54:12AM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> > PMD page table is installed or not.
> >
> > Consider following example:
> >
> > 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> > 		 MAP_SHARED, fd, 0);
> > 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> >
> > fd is a populated tmpfs file.
> >
> > The result depends on the address that the kernel returns on mmap().
> > If it is located in an existing PMD table, the madvise() will succeed.
> > However, if the table does not exist, it will fail with -EINVAL.
> >
> > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> > a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> >
> > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> > tables as needed.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> 
> There was a v1 with tags, you've not propagated any of them? Did you feel
> the change was enough to remove them?

I moved code around and was not comfortable to carry tags over.

> Anyway, LGTM so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> > ---
> >
> > v2:
> >  - Modify set_huge_pmd() instead of introducing install_huge_pmd();
> >
> > ---
> >  mm/khugepaged.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b486c1d19b2d..986718599355 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> >  static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> >  			pmd_t *pmdp, struct folio *folio, struct page *page)
> >  {
> > +	struct mm_struct *mm = vma->vm_mm;
> >  	struct vm_fault vmf = {
> >  		.vma = vma,
> >  		.address = addr,
> >  		.flags = 0,
> > -		.pmd = pmdp,
> >  	};
> > +	pgd_t *pgdp;
> > +	p4d_t *p4dp;
> > +	pud_t *pudp;
> >
> >  	mmap_assert_locked(vma->vm_mm);
> 
> NIT: you have mm as a local var should use here too. Not a big deal though
> obviously...

Do you want v3 for this?

> >
> > +	if (!pmdp) {
> > +		pgdp = pgd_offset(mm, addr);
> > +		p4dp = p4d_alloc(mm, pgdp, addr);
> > +		if (!p4dp)
> > +			return SCAN_FAIL;
> > +		pudp = pud_alloc(mm, p4dp, addr);
> > +		if (!pudp)
> > +			return SCAN_FAIL;
> > +		pmdp = pmd_alloc(mm, pudp, addr);
> > +		if (!pmdp)
> > +			return SCAN_FAIL;
> > +	}
> > +
> > +	vmf.pmd = pmdp;
> >  	if (do_set_pmd(&vmf, folio, page))
> >  		return SCAN_FAIL;
> >
> > @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >  	switch (result) {
> >  	case SCAN_SUCCEED:
> >  		break;
> > +	case SCAN_PMD_NULL:
> >  	case SCAN_PMD_NONE:
> >  		/*
> >  		 * All pte entries have been removed and pmd cleared.
> > --
> > 2.50.1
> >

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Lorenzo Stoakes 2 weeks ago
Sorry mutt hid this reply from me...

On Wed, Sep 17, 2025 at 11:43:08AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 10:54:12AM +0100, Lorenzo Stoakes wrote:
> > There was a v1 with tags, you've not propagated any of them? Did you feel
> > the change was enough to remove them?
>
> I moved code around and was not comfortable to carry tags over.

Ack.

>
> > Anyway, LGTM so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > ---
> > >
> > > v2:
> > >  - Modify set_huge_pmd() instead of introducing install_huge_pmd();
> > >
> > > ---
> > >  mm/khugepaged.c | 20 +++++++++++++++++++-
> > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index b486c1d19b2d..986718599355 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> > >  static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >  			pmd_t *pmdp, struct folio *folio, struct page *page)
> > >  {
> > > +	struct mm_struct *mm = vma->vm_mm;
> > >  	struct vm_fault vmf = {
> > >  		.vma = vma,
> > >  		.address = addr,
> > >  		.flags = 0,
> > > -		.pmd = pmdp,
> > >  	};
> > > +	pgd_t *pgdp;
> > > +	p4d_t *p4dp;
> > > +	pud_t *pudp;
> > >
> > >  	mmap_assert_locked(vma->vm_mm);
> >
> > NIT: you have mm as a local var should use here too. Not a big deal though
> > obviously...
>
> Do you want v3 for this?

No, this is not a big deal.
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Zach O'Keefe 2 weeks, 1 day ago
On Tue, Sep 16, 2025 at 2:54 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> > PMD page table is installed or not.
> >
> > Consider following example:
> >
> >       p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> >                MAP_SHARED, fd, 0);
> >       err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> >
> > fd is a populated tmpfs file.
> >
> > The result depends on the address that the kernel returns on mmap().
> > If it is located in an existing PMD table, the madvise() will succeed.
> > However, if the table does not exist, it will fail with -EINVAL.
> >
> > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> > a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> >
> > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> > tables as needed.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

So, since we are trying to aim for consistency here, I think we ought
to also support the anonymous case.

I don't have a patch, but can spot at least two things we'd need to adjust:

First, we are defeated by the check in __thp_vma_allowable_orders();

        /*
         * THPeligible bit of smaps should show 1 for proper VMAs even
         * though anon_vma is not initialized yet.
         *
         * Allow page fault since anon_vma may be not initialized until
         * the first page fault.
         */
        if (!vma->anon_vma)
                return (smaps || in_pf) ? orders : 0;

I think we can probably just delete that check, but would need to confirm.

And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
collapse_pte_mapped_thp(). I think we just need to audit places where
we return this code, to make sure it's faithfully describing a
situation where we can go ahead and install a new pmd. As a hasty
check, the return codes in check_pmd_state() don't look to follow
that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
there are many underlying failure reasons for
pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
entry".

WDYT?

> There was a v1 with tags, you've not propagated any of them? Did you feel
> the change was enough to remove them?
>
> Anyway, LGTM so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >
> > v2:
> >  - Modify set_huge_pmd() instead of introducing install_huge_pmd();
> >
> > ---
> >  mm/khugepaged.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b486c1d19b2d..986718599355 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> >  static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> >                       pmd_t *pmdp, struct folio *folio, struct page *page)
> >  {
> > +     struct mm_struct *mm = vma->vm_mm;
> >       struct vm_fault vmf = {
> >               .vma = vma,
> >               .address = addr,
> >               .flags = 0,
> > -             .pmd = pmdp,
> >       };
> > +     pgd_t *pgdp;
> > +     p4d_t *p4dp;
> > +     pud_t *pudp;
> >
> >       mmap_assert_locked(vma->vm_mm);
>
> NIT: you have mm as a local var should use here too. Not a big deal though
> obviously...
>
> >
> > +     if (!pmdp) {
> > +             pgdp = pgd_offset(mm, addr);
> > +             p4dp = p4d_alloc(mm, pgdp, addr);
> > +             if (!p4dp)
> > +                     return SCAN_FAIL;
> > +             pudp = pud_alloc(mm, p4dp, addr);
> > +             if (!pudp)
> > +                     return SCAN_FAIL;
> > +             pmdp = pmd_alloc(mm, pudp, addr);
> > +             if (!pmdp)
> > +                     return SCAN_FAIL;
> > +     }
> > +
> > +     vmf.pmd = pmdp;
> >       if (do_set_pmd(&vmf, folio, page))
> >               return SCAN_FAIL;
> >
> > @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >       switch (result) {
> >       case SCAN_SUCCEED:
> >               break;
> > +     case SCAN_PMD_NULL:
> >       case SCAN_PMD_NONE:
> >               /*
> >                * All pte entries have been removed and pmd cleared.
> > --
> > 2.50.1
> >
>
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Kiryl Shutsemau 2 weeks, 1 day ago
On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote:
> So, since we are trying to aim for consistency here, I think we ought
> to also support the anonymous case.
> 
> I don't have a patch, but can spot at least two things we'd need to adjust:
> 
> First, we are defeated by the check in __thp_vma_allowable_orders();
> 
>         /*
>          * THPeligible bit of smaps should show 1 for proper VMAs even
>          * though anon_vma is not initialized yet.
>          *
>          * Allow page fault since anon_vma may be not initialized until
>          * the first page fault.
>          */
>         if (!vma->anon_vma)
>                 return (smaps || in_pf) ? orders : 0;
> 
> I think we can probably just delete that check, but would need to confirm.

Do you want MADV_COLLAPSE to work on VMAs that never got a page fault?

I think it should be fine as long as we agree that MADV_COLLAPSE implies
memory population. I think it should, but I want to be sure we are on
the same page.

I also brings a question on holes in the files on MADV_COLLAPSE. We
might want to populate them too. But it means the logic between
MADV_COLLAPSE and khugepaged will diverge. It requires larger
refactoring.

> And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
> collapse_pte_mapped_thp(). I think we just need to audit places where
> we return this code, to make sure it's faithfully describing a
> situation where we can go ahead and install a new pmd. As a hasty
> check, the return codes in check_pmd_state() don't look to follow
> that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
> there are many underlying failure reasons for
> pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
> entry".

Sounds like a plan :)

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Lorenzo Stoakes 2 weeks ago
On Wed, Sep 17, 2025 at 11:52:34AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote:
> > So, since we are trying to aim for consistency here, I think we ought
> > to also support the anonymous case.
> >
> > I don't have a patch, but can spot at least two things we'd need to adjust:
> >
> > First, we are defeated by the check in __thp_vma_allowable_orders();
> >
> >         /*
> >          * THPeligible bit of smaps should show 1 for proper VMAs even
> >          * though anon_vma is not initialized yet.
> >          *
> >          * Allow page fault since anon_vma may be not initialized until
> >          * the first page fault.
> >          */
> >         if (!vma->anon_vma)
> >                 return (smaps || in_pf) ? orders : 0;
> >
> > I think we can probably just delete that check, but would need to confirm.
>
> Do you want MADV_COLLAPSE to work on VMAs that never got a page fault?
>
> I think it should be fine as long as we agree that MADV_COLLAPSE implies
> memory population. I think it should, but I want to be sure we are on
> the same page.

I'm definitely in favour of MADV_COLLAPSE not requiring pre-faulting like
that.

As long as nothing is assuming that an anon_vma already exists, which
surely is not possible as we ignore in smaps/pf case.

(Oh how I hate how anon_vma works)

Though would this not also impact khugepaged?

thp_vma_allowable_order() called with TVA_KHUGEPAGED are in:

- khugepaged_enter_vma() - does't matter really, one VMA having anon_vma or
  not should be immaterial to adding the mm to khugepaged, not hugely
  likely _none_ would have anon_vma...

- khugepaged_scan_mm_slot() - well, we will require at least 1 PTE for this
  to succeed anyway right? We might waste some time though. But probably
  not much.  hpage_collapse_scan_pmd() -> collapse_huge_page() tries an
  anon VMA lock, but would have to have found PTEs to succeed...

So actually it is fine to just remove the check.

>
> I also brings a question on holes in the files on MADV_COLLAPSE. We
> might want to populate them too. But it means the logic between
> MADV_COLLAPSE and khugepaged will diverge. It requires larger
> refactoring.

I think all of THP requires a larger refactoring :) but interesting point.

>
> > And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
> > collapse_pte_mapped_thp(). I think we just need to audit places where
> > we return this code, to make sure it's faithfully describing a
> > situation where we can go ahead and install a new pmd. As a hasty
> > check, the return codes in check_pmd_state() don't look to follow
> > that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
> > there are many underlying failure reasons for
> > pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
> > entry".
>
> Sounds like a plan :)

There being differences is kind of horrible. So yes indeed, one to look at
:)

>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Cheers, Lorenzo
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Zach O'Keefe 2 weeks, 1 day ago
On Wed, Sep 17, 2025 at 3:52 AM Kiryl Shutsemau <kirill@shutemov.name> wrote:
>
> On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote:
> > So, since we are trying to aim for consistency here, I think we ought
> > to also support the anonymous case.
> >
> > I don't have a patch, but can spot at least two things we'd need to adjust:
> >
> > First, we are defeated by the check in __thp_vma_allowable_orders();
> >
> >         /*
> >          * THPeligible bit of smaps should show 1 for proper VMAs even
> >          * though anon_vma is not initialized yet.
> >          *
> >          * Allow page fault since anon_vma may be not initialized until
> >          * the first page fault.
> >          */
> >         if (!vma->anon_vma)
> >                 return (smaps || in_pf) ? orders : 0;
> >
> > I think we can probably just delete that check, but would need to confirm.
>
> Do you want MADV_COLLAPSE to work on VMAs that never got a page fault?
>
> I think it should be fine as long as we agree that MADV_COLLAPSE implies
> memory population. I think it should, but I want to be sure we are on
> the same page.

Exactly. I'm always a little embarrassed when telling people about how
to successfully use MADV_COLLAPSE, "oh, but makes sure you fault at
least one page beforehand because of ~reasons~"

> I also brings a question on holes in the files on MADV_COLLAPSE. We
> might want to populate them too. But it means the logic between
> MADV_COLLAPSE and khugepaged will diverge. It requires larger
> refactoring.

Yeah, and taking a look more thorough am perhaps reminded why I didn't
pursue this yet.

> > And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
> > collapse_pte_mapped_thp(). I think we just need to audit places where
> > we return this code, to make sure it's faithfully describing a
> > situation where we can go ahead and install a new pmd. As a hasty
> > check, the return codes in check_pmd_state() don't look to follow
> > that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
> > there are many underlying failure reasons for
> > pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
> > entry".
>
> Sounds like a plan :)

:) Frankly, I don't have cycles to tackle this at the moment, and
unfair to push the work on you, given it's non-trivial, so can have my

Reviewed-by: Zach O'Keefe <zokeefe@google.com>

For this patch ; though Andrew has already taken it

Hopefully I can look and sneak improvements into 6.18 -- but wouldn't
hold my breath.

> --
>   Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Zi Yan 2 weeks, 2 days ago
On 15 Sep 2025, at 9:52, Kiryl Shutsemau wrote:

> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.

Why does collapse code want to know the difference between SCAN_PMD_NULL and
SCAN_PMD_NONE? Both seems to be treated as “nothing here, install a PMD
leaf”. One difference is that madvise_collapse() will continue
on SCAN_PMD_NULL but bail out on SCAN_PMD_NONE.

I wonder if we could have SCAN_PMD_NULL_OR_NONE instead.

Zach, since you added both, can you share some insight? Thanks.

>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>
> v2:
>  - Modify set_huge_pmd() instead of introducing install_huge_pmd();
>
> ---
>  mm/khugepaged.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>

The changes look good to me. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by Dev Jain 2 weeks, 2 days ago
On 15/09/25 7:22 pm, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>
> v2:
>   - Modify set_huge_pmd() instead of introducing install_huge_pmd();
>
> ---
>   mm/khugepaged.c | 20 +++++++++++++++++++-
>   1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b486c1d19b2d..986718599355 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>   static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   			pmd_t *pmdp, struct folio *folio, struct page *page)
>   {
> +	struct mm_struct *mm = vma->vm_mm;
>   	struct vm_fault vmf = {
>   		.vma = vma,
>   		.address = addr,
>   		.flags = 0,
> -		.pmd = pmdp,
>   	};
> +	pgd_t *pgdp;
> +	p4d_t *p4dp;
> +	pud_t *pudp;
>   
>   	mmap_assert_locked(vma->vm_mm);

I was going to reply to v1 - you could replace vma->vm_mm
with mm here.

Reviewed-by: Dev Jain <dev.jain@arm.com>
Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
Posted by David Hildenbrand 2 weeks, 3 days ago
On 15.09.25 15:52, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
> 
> Consider following example:
> 
> 	p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> 		 MAP_SHARED, fd, 0);
> 	err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> 
> fd is a populated tmpfs file.
> 
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
> 
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> 
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb