mm/khugepaged.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-)
From: Kiryl Shutsemau <kas@kernel.org>
MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
PMD page table is installed or not.
Consider following example:
p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
err = madvise(p, 2UL << 20, MADV_COLLAPSE);
fd is a populated tmpfs file.
The result depends on the address that the kernel returns on mmap().
If it is located in an existing PMD table, the madvise() will succeed.
However, if the table does not exist, it will fail with -EINVAL.
This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
a page table is missing, which causes collapse_pte_mapped_thp() to fail.
SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
v2:
- Modify set_huge_pmd() instead of introducing install_huge_pmd();
---
mm/khugepaged.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b486c1d19b2d..986718599355 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct folio *folio, struct page *page)
{
+ struct mm_struct *mm = vma->vm_mm;
struct vm_fault vmf = {
.vma = vma,
.address = addr,
.flags = 0,
- .pmd = pmdp,
};
+ pgd_t *pgdp;
+ p4d_t *p4dp;
+ pud_t *pudp;
mmap_assert_locked(vma->vm_mm);
+ if (!pmdp) {
+ pgdp = pgd_offset(mm, addr);
+ p4dp = p4d_alloc(mm, pgdp, addr);
+ if (!p4dp)
+ return SCAN_FAIL;
+ pudp = pud_alloc(mm, p4dp, addr);
+ if (!pudp)
+ return SCAN_FAIL;
+ pmdp = pmd_alloc(mm, pudp, addr);
+ if (!pmdp)
+ return SCAN_FAIL;
+ }
+
+ vmf.pmd = pmdp;
if (do_set_pmd(&vmf, folio, page))
return SCAN_FAIL;
@@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
switch (result) {
case SCAN_SUCCEED:
break;
+ case SCAN_PMD_NULL:
case SCAN_PMD_NONE:
/*
* All pte entries have been removed and pmd cleared.
--
2.50.1
On 2025/9/15 21:52, Kiryl Shutsemau wrote: > From: Kiryl Shutsemau <kas@kernel.org> > > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if > PMD page table is installed or not. > > Consider following example: > > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > err = madvise(p, 2UL << 20, MADV_COLLAPSE); > > fd is a populated tmpfs file. > > The result depends on the address that the kernel returns on mmap(). > If it is located in an existing PMD table, the madvise() will succeed. > However, if the table does not exist, it will fail with -EINVAL. > > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when > a page table is missing, which causes collapse_pte_mapped_thp() to fail. > > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page > tables as needed. > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org> > --- LGTM. Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote: > From: Kiryl Shutsemau <kas@kernel.org> > > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if > PMD page table is installed or not. > > Consider following example: > > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > err = madvise(p, 2UL << 20, MADV_COLLAPSE); > > fd is a populated tmpfs file. > > The result depends on the address that the kernel returns on mmap(). > If it is located in an existing PMD table, the madvise() will succeed. > However, if the table does not exist, it will fail with -EINVAL. > > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when > a page table is missing, which causes collapse_pte_mapped_thp() to fail. > > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page > tables as needed. > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org> There was a v1 with tags, you've not propagated any of them? Did you feel the change was enough to remove them? Anyway, LGTM so: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > > v2: > - Modify set_huge_pmd() instead of introducing install_huge_pmd(); > > --- > mm/khugepaged.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b486c1d19b2d..986718599355 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, > pmd_t *pmdp, struct folio *folio, struct page *page) > { > + struct mm_struct *mm = vma->vm_mm; > struct vm_fault vmf = { > .vma = vma, > .address = addr, > .flags = 0, > - .pmd = pmdp, > }; > + pgd_t *pgdp; > + p4d_t *p4dp; > + pud_t *pudp; > > mmap_assert_locked(vma->vm_mm); NIT: you have mm as a local var should use here too. Not a big deal though obviously... > > + if (!pmdp) { > + pgdp = pgd_offset(mm, addr); > + p4dp = p4d_alloc(mm, pgdp, addr); > + if (!p4dp) > + return SCAN_FAIL; > + pudp = pud_alloc(mm, p4dp, addr); > + if (!pudp) > + return SCAN_FAIL; > + pmdp = pmd_alloc(mm, pudp, addr); > + if (!pmdp) > + return SCAN_FAIL; > + } > + > + vmf.pmd = pmdp; > if (do_set_pmd(&vmf, folio, page)) > return SCAN_FAIL; > > @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > switch (result) { > case SCAN_SUCCEED: > break; > + case SCAN_PMD_NULL: > case SCAN_PMD_NONE: > /* > * All pte entries have been removed and pmd cleared. > -- > 2.50.1 >
On Tue, Sep 16, 2025 at 10:54:12AM +0100, Lorenzo Stoakes wrote: > On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote: > > From: Kiryl Shutsemau <kas@kernel.org> > > > > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if > > PMD page table is installed or not. > > > > Consider following example: > > > > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, > > MAP_SHARED, fd, 0); > > err = madvise(p, 2UL << 20, MADV_COLLAPSE); > > > > fd is a populated tmpfs file. > > > > The result depends on the address that the kernel returns on mmap(). > > If it is located in an existing PMD table, the madvise() will succeed. > > However, if the table does not exist, it will fail with -EINVAL. > > > > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when > > a page table is missing, which causes collapse_pte_mapped_thp() to fail. > > > > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in > > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page > > tables as needed. > > > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org> > > There was a v1 with tags, you've not propagated any of them? Did you feel > the change was enough to remove them? I moved code around and was not comfortable to carry tags over. > Anyway, LGTM so: > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > --- > > > > v2: > > - Modify set_huge_pmd() instead of introducing install_huge_pmd(); > > > > --- > > mm/khugepaged.c | 20 +++++++++++++++++++- > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index b486c1d19b2d..986718599355 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, > > pmd_t *pmdp, struct folio *folio, struct page *page) > > { > > + struct mm_struct *mm = vma->vm_mm; > > struct vm_fault vmf = { > > .vma = vma, > > .address = addr, > > .flags = 0, > > - .pmd = pmdp, > > }; > > + pgd_t *pgdp; > > + p4d_t *p4dp; > > + pud_t *pudp; > > > > mmap_assert_locked(vma->vm_mm); > > NIT: you have mm as a local var should use here too. Not a big deal though > obviously... Do you want v3 for this? > > > > + if (!pmdp) { > > + pgdp = pgd_offset(mm, addr); > > + p4dp = p4d_alloc(mm, pgdp, addr); > > + if (!p4dp) > > + return SCAN_FAIL; > > + pudp = pud_alloc(mm, p4dp, addr); > > + if (!pudp) > > + return SCAN_FAIL; > > + pmdp = pmd_alloc(mm, pudp, addr); > > + if (!pmdp) > > + return SCAN_FAIL; > > + } > > + > > + vmf.pmd = pmdp; > > if (do_set_pmd(&vmf, folio, page)) > > return SCAN_FAIL; > > > > @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > switch (result) { > > case SCAN_SUCCEED: > > break; > > + case SCAN_PMD_NULL: > > case SCAN_PMD_NONE: > > /* > > * All pte entries have been removed and pmd cleared. > > -- > > 2.50.1 > > -- Kiryl Shutsemau / Kirill A. Shutemov
Sorry mutt hid this reply from me... On Wed, Sep 17, 2025 at 11:43:08AM +0100, Kiryl Shutsemau wrote: > On Tue, Sep 16, 2025 at 10:54:12AM +0100, Lorenzo Stoakes wrote: > > There was a v1 with tags, you've not propagated any of them? Did you feel > > the change was enough to remove them? > > I moved code around and was not comfortable to carry tags over. Ack. > > > Anyway, LGTM so: > > > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > > --- > > > > > > v2: > > > - Modify set_huge_pmd() instead of introducing install_huge_pmd(); > > > > > > --- > > > mm/khugepaged.c | 20 +++++++++++++++++++- > > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index b486c1d19b2d..986718599355 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > > > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, > > > pmd_t *pmdp, struct folio *folio, struct page *page) > > > { > > > + struct mm_struct *mm = vma->vm_mm; > > > struct vm_fault vmf = { > > > .vma = vma, > > > .address = addr, > > > .flags = 0, > > > - .pmd = pmdp, > > > }; > > > + pgd_t *pgdp; > > > + p4d_t *p4dp; > > > + pud_t *pudp; > > > > > > mmap_assert_locked(vma->vm_mm); > > > > NIT: you have mm as a local var should use here too. Not a big deal though > > obviously... > > Do you want v3 for this? No, this is not a big deal.
On Tue, Sep 16, 2025 at 2:54 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote: > > From: Kiryl Shutsemau <kas@kernel.org> > > > > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if > > PMD page table is installed or not. > > > > Consider following example: > > > > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, > > MAP_SHARED, fd, 0); > > err = madvise(p, 2UL << 20, MADV_COLLAPSE); > > > > fd is a populated tmpfs file. > > > > The result depends on the address that the kernel returns on mmap(). > > If it is located in an existing PMD table, the madvise() will succeed. > > However, if the table does not exist, it will fail with -EINVAL. > > > > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when > > a page table is missing, which causes collapse_pte_mapped_thp() to fail. > > > > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in > > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page > > tables as needed. > > > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org> So, since we are trying to aim for consistency here, I think we ought to also support the anonymous case. I don't have a patch, but can spot at least two things we'd need to adjust: First, we are defeated by the check in __thp_vma_allowable_orders(); /* * THPeligible bit of smaps should show 1 for proper VMAs even * though anon_vma is not initialized yet. * * Allow page fault since anon_vma may be not initialized until * the first page fault. */ if (!vma->anon_vma) return (smaps || in_pf) ? orders : 0; I think we can probably just delete that check, but would need to confirm. And second, madvise_collapse() doesn't route SCAN_PMD_NULL to collapse_pte_mapped_thp(). I think we just need to audit places where we return this code, to make sure it's faithfully describing a situation where we can go ahead and install a new pmd. As a hasty check, the return codes in check_pmd_state() don't look to follow that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise, there are many underlying failure reasons for pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD entry". WDYT? > There was a v1 with tags, you've not propagated any of them? Did you feel > the change was enough to remove them? > > Anyway, LGTM so: > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > --- > > > > v2: > > - Modify set_huge_pmd() instead of introducing install_huge_pmd(); > > > > --- > > mm/khugepaged.c | 20 +++++++++++++++++++- > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index b486c1d19b2d..986718599355 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, > > pmd_t *pmdp, struct folio *folio, struct page *page) > > { > > + struct mm_struct *mm = vma->vm_mm; > > struct vm_fault vmf = { > > .vma = vma, > > .address = addr, > > .flags = 0, > > - .pmd = pmdp, > > }; > > + pgd_t *pgdp; > > + p4d_t *p4dp; > > + pud_t *pudp; > > > > mmap_assert_locked(vma->vm_mm); > > NIT: you have mm as a local var should use here too. Not a big deal though > obviously... > > > > > + if (!pmdp) { > > + pgdp = pgd_offset(mm, addr); > > + p4dp = p4d_alloc(mm, pgdp, addr); > > + if (!p4dp) > > + return SCAN_FAIL; > > + pudp = pud_alloc(mm, p4dp, addr); > > + if (!pudp) > > + return SCAN_FAIL; > > + pmdp = pmd_alloc(mm, pudp, addr); > > + if (!pmdp) > > + return SCAN_FAIL; > > + } > > + > > + vmf.pmd = pmdp; > > if (do_set_pmd(&vmf, folio, page)) > > return SCAN_FAIL; > > > > @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > switch (result) { > > case SCAN_SUCCEED: > > break; > > + case SCAN_PMD_NULL: > > case SCAN_PMD_NONE: > > /* > > * All pte entries have been removed and pmd cleared. > > -- > > 2.50.1 > > >
On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote: > So, since we are trying to aim for consistency here, I think we ought > to also support the anonymous case. > > I don't have a patch, but can spot at least two things we'd need to adjust: > > First, we are defeated by the check in __thp_vma_allowable_orders(); > > /* > * THPeligible bit of smaps should show 1 for proper VMAs even > * though anon_vma is not initialized yet. > * > * Allow page fault since anon_vma may be not initialized until > * the first page fault. > */ > if (!vma->anon_vma) > return (smaps || in_pf) ? orders : 0; > > I think we can probably just delete that check, but would need to confirm. Do you want MADV_COLLAPSE to work on VMAs that never got a page fault? I think it should be fine as long as we agree that MADV_COLLAPSE implies memory population. I think it should, but I want to be sure we are on the same page. I also brings a question on holes in the files on MADV_COLLAPSE. We might want to populate them too. But it means the logic between MADV_COLLAPSE and khugepaged will diverge. It requires larger refactoring. > And second, madvise_collapse() doesn't route SCAN_PMD_NULL to > collapse_pte_mapped_thp(). I think we just need to audit places where > we return this code, to make sure it's faithfully describing a > situation where we can go ahead and install a new pmd. As a hasty > check, the return codes in check_pmd_state() don't look to follow > that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise, > there are many underlying failure reasons for > pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD > entry". Sounds like a plan :) -- Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Sep 17, 2025 at 11:52:34AM +0100, Kiryl Shutsemau wrote: > On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote: > > So, since we are trying to aim for consistency here, I think we ought > > to also support the anonymous case. > > > > I don't have a patch, but can spot at least two things we'd need to adjust: > > > > First, we are defeated by the check in __thp_vma_allowable_orders(); > > > > /* > > * THPeligible bit of smaps should show 1 for proper VMAs even > > * though anon_vma is not initialized yet. > > * > > * Allow page fault since anon_vma may be not initialized until > > * the first page fault. > > */ > > if (!vma->anon_vma) > > return (smaps || in_pf) ? orders : 0; > > > > I think we can probably just delete that check, but would need to confirm. > > Do you want MADV_COLLAPSE to work on VMAs that never got a page fault? > > I think it should be fine as long as we agree that MADV_COLLAPSE implies > memory population. I think it should, but I want to be sure we are on > the same page. I'm definitely in favour of MADV_COLLAPSE not requiring pre-faulting like that. As long as nothing is assuming that an anon_vma already exists, which surely is not possible as we ignore in smaps/pf case. (Oh how I hate how anon_vma works) Though would this not also impact khugepaged? thp_vma_allowable_order() called with TVA_KHUGEPAGED are in: - khugepaged_enter_vma() - does't matter really, one VMA having anon_vma or not should be immaterial to adding the mm to khugepaged, not hugely likely _none_ would have anon_vma... - khugepaged_scan_mm_slot() - well, we will require at least 1 PTE for this to succeed anyway right? We might waste some time though. But probably not much. hpage_collapse_scan_pmd() -> collapse_huge_page() tries an anon VMA lock, but would have to have found PTEs to succeed... So actually it is fine to just remove the check. > > I also brings a question on holes in the files on MADV_COLLAPSE. We > might want to populate them too. But it means the logic between > MADV_COLLAPSE and khugepaged will diverge. It requires larger > refactoring. I think all of THP requires a larger refactoring :) but interesting point. > > > And second, madvise_collapse() doesn't route SCAN_PMD_NULL to > > collapse_pte_mapped_thp(). I think we just need to audit places where > > we return this code, to make sure it's faithfully describing a > > situation where we can go ahead and install a new pmd. As a hasty > > check, the return codes in check_pmd_state() don't look to follow > > that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise, > > there are many underlying failure reasons for > > pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD > > entry". > > Sounds like a plan :) There being differences is kind of horrible. So yes indeed, one to look at :) > > -- > Kiryl Shutsemau / Kirill A. Shutemov Cheers, Lorenzo
On Wed, Sep 17, 2025 at 3:52 AM Kiryl Shutsemau <kirill@shutemov.name> wrote: > > On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote: > > So, since we are trying to aim for consistency here, I think we ought > > to also support the anonymous case. > > > > I don't have a patch, but can spot at least two things we'd need to adjust: > > > > First, we are defeated by the check in __thp_vma_allowable_orders(); > > > > /* > > * THPeligible bit of smaps should show 1 for proper VMAs even > > * though anon_vma is not initialized yet. > > * > > * Allow page fault since anon_vma may be not initialized until > > * the first page fault. > > */ > > if (!vma->anon_vma) > > return (smaps || in_pf) ? orders : 0; > > > > I think we can probably just delete that check, but would need to confirm. > > Do you want MADV_COLLAPSE to work on VMAs that never got a page fault? > > I think it should be fine as long as we agree that MADV_COLLAPSE implies > memory population. I think it should, but I want to be sure we are on > the same page. Exactly. I'm always a little embarrassed when telling people about how to successfully use MADV_COLLAPSE, "oh, but makes sure you fault at least one page beforehand because of ~reasons~" > I also brings a question on holes in the files on MADV_COLLAPSE. We > might want to populate them too. But it means the logic between > MADV_COLLAPSE and khugepaged will diverge. It requires larger > refactoring. Yeah, and taking a look more thorough am perhaps reminded why I didn't pursue this yet. > > And second, madvise_collapse() doesn't route SCAN_PMD_NULL to > > collapse_pte_mapped_thp(). I think we just need to audit places where > > we return this code, to make sure it's faithfully describing a > > situation where we can go ahead and install a new pmd. As a hasty > > check, the return codes in check_pmd_state() don't look to follow > > that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise, > > there are many underlying failure reasons for > > pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD > > entry". > > Sounds like a plan :) :) Frankly, I don't have cycles to tackle this at the moment, and unfair to push the work on you, given it's non-trivial, so can have my Reviewed-by: Zach O'Keefe <zokeefe@google.com> For this patch ; though Andrew has already taken it Hopefully I can look and sneak improvements into 6.18 -- but wouldn't hold my breath. > -- > Kiryl Shutsemau / Kirill A. Shutemov
On 15 Sep 2025, at 9:52, Kiryl Shutsemau wrote: > From: Kiryl Shutsemau <kas@kernel.org> > > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if > PMD page table is installed or not. > > Consider following example: > > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > err = madvise(p, 2UL << 20, MADV_COLLAPSE); > > fd is a populated tmpfs file. > > The result depends on the address that the kernel returns on mmap(). > If it is located in an existing PMD table, the madvise() will succeed. > However, if the table does not exist, it will fail with -EINVAL. > > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when > a page table is missing, which causes collapse_pte_mapped_thp() to fail. > > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page > tables as needed. Why does collapse code want to know the difference between SCAN_PMD_NULL and SCAN_PMD_NONE? Both seems to be treated as “nothing here, install a PMD leaf”. One difference is that madvise_collapse() will continue on SCAN_PMD_NULL but bail out on SCAN_PMD_NONE. I wonder if we could have SCAN_PMD_NULL_OR_NONE instead. Zach, since you added both, can you share some insight? Thanks. > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org> > --- > > v2: > - Modify set_huge_pmd() instead of introducing install_huge_pmd(); > > --- > mm/khugepaged.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > The changes look good to me. Reviewed-by: Zi Yan <ziy@nvidia.com> Best Regards, Yan, Zi
On 15/09/25 7:22 pm, Kiryl Shutsemau wrote: > From: Kiryl Shutsemau <kas@kernel.org> > > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if > PMD page table is installed or not. > > Consider following example: > > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > err = madvise(p, 2UL << 20, MADV_COLLAPSE); > > fd is a populated tmpfs file. > > The result depends on the address that the kernel returns on mmap(). > If it is located in an existing PMD table, the madvise() will succeed. > However, if the table does not exist, it will fail with -EINVAL. > > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when > a page table is missing, which causes collapse_pte_mapped_thp() to fail. > > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page > tables as needed. > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org> > --- > > v2: > - Modify set_huge_pmd() instead of introducing install_huge_pmd(); > > --- > mm/khugepaged.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b486c1d19b2d..986718599355 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, > pmd_t *pmdp, struct folio *folio, struct page *page) > { > + struct mm_struct *mm = vma->vm_mm; > struct vm_fault vmf = { > .vma = vma, > .address = addr, > .flags = 0, > - .pmd = pmdp, > }; > + pgd_t *pgdp; > + p4d_t *p4dp; > + pud_t *pudp; > > mmap_assert_locked(vma->vm_mm); I was going to reply to v1 - you could replace vma->vm_mm with mm here. Reviewed-by: Dev Jain <dev.jain@arm.com>
On 15.09.25 15:52, Kiryl Shutsemau wrote: > From: Kiryl Shutsemau <kas@kernel.org> > > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if > PMD page table is installed or not. > > Consider following example: > > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > err = madvise(p, 2UL << 20, MADV_COLLAPSE); > > fd is a populated tmpfs file. > > The result depends on the address that the kernel returns on mmap(). > If it is located in an existing PMD table, the madvise() will succeed. > However, if the table does not exist, it will fail with -EINVAL. > > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when > a page table is missing, which causes collapse_pte_mapped_thp() to fail. > > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page > tables as needed. > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org> > --- Acked-by: David Hildenbrand <david@redhat.com> -- Cheers David / dhildenb
© 2016 - 2025 Red Hat, Inc.