[PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios

Vernon Yang posted 5 patches 2 weeks, 1 day ago
There is a newer version of this series
[PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
Posted by Vernon Yang 2 weeks, 1 day ago
From: Vernon Yang <yanglincheng@kylinos.cn>

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

And if we collapse with a lazyfree page, that content will never be none
and the deferred shrinker cannot reclaim them.

So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 384e29f6bef0..bcdc57eea270 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
 	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
 	EM( SCAN_PAGE_LOCK,		"page_locked")			\
 	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
+	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
 	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
 	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
 	EM( SCAN_VMA_NULL,		"vma_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index de95029e3763..be1c09842ea2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -46,6 +46,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out;
+		}
+
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
@@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);
 
+		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out_unmap;
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;
-- 
2.51.0
Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
Posted by Lance Yang 2 weeks ago

On 2026/1/23 16:22, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
> 
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> And if we collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.
> 
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h |  1 +
>   mm/khugepaged.c                    | 11 +++++++++++
>   2 files changed, 12 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 384e29f6bef0..bcdc57eea270 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
>   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>   	EM( SCAN_VMA_NULL,		"vma_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index de95029e3763..be1c09842ea2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
>   	SCAN_PAGE_LRU,
>   	SCAN_PAGE_LOCK,
>   	SCAN_PAGE_ANON,
> +	SCAN_PAGE_LAZYFREE,
>   	SCAN_PAGE_COMPOUND,
>   	SCAN_ANY_PROCESS,
>   	SCAN_VMA_NULL,
> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		folio = page_folio(page);
>   		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>   
> +		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {

I'm wondering if we need "cc->is_khugepaged &&" as well here?

We should allow users to enforce collapse via the madvise_collapse()
path even if pages are marked lazyfree, IMHO.

> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out;
> +		}
> +
>   		/* See hpage_collapse_scan_pmd(). */
>   		if (folio_maybe_mapped_shared(folio)) {
>   			++shared;
> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		}
>   		folio = page_folio(page);
>   
> +		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {

Ditto.

> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out_unmap;
> +		}
> +
>   		if (!folio_test_anon(folio)) {
>   			result = SCAN_PAGE_ANON;
>   			goto out_unmap;
Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
Posted by Vernon Yang 2 weeks ago
On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/23 16:22, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > And if we collapse with a lazyfree page, that content will never be none
> > and the deferred shrinker cannot reclaim them.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h |  1 +
> >   mm/khugepaged.c                    | 11 +++++++++++
> >   2 files changed, 12 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 384e29f6bef0..bcdc57eea270 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> >       EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> >       EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> >       EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > +     EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> >       EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> >       EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> >       EM( SCAN_VMA_NULL,              "vma_null")                     \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index de95029e3763..be1c09842ea2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -46,6 +46,7 @@ enum scan_result {
> >       SCAN_PAGE_LRU,
> >       SCAN_PAGE_LOCK,
> >       SCAN_PAGE_ANON,
> > +     SCAN_PAGE_LAZYFREE,
> >       SCAN_PAGE_COMPOUND,
> >       SCAN_ANY_PROCESS,
> >       SCAN_VMA_NULL,
> > @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               folio = page_folio(page);
> >               VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >
> > +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>
> I'm wondering if we need "cc->is_khugepaged &&" as well here?
>
> We should allow users to enforce collapse via the madvise_collapse()
> path even if pages are marked lazyfree, IMHO.

$ man madvise
MADV_COLLAPSE
        Perform a best-effort synchronous collapse of the native pages
        mapped by the memory range into Transparent Huge Pages (THPs).

The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
collapsing, so we don't need "cc->is_khugepaged" here.

We can imagine that if a user simultaneously uses MADV_FREE and
MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
As the kernel, we need to safeguard the baseline.

> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out;
> > +             }
> > +
> >               /* See hpage_collapse_scan_pmd(). */
> >               if (folio_maybe_mapped_shared(folio)) {
> >                       ++shared;
> > @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >               }
> >               folio = page_folio(page);
> >
> > +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>
> Ditto.
>
> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out_unmap;
> > +             }
> > +
> >               if (!folio_test_anon(folio)) {
> >                       result = SCAN_PAGE_ANON;
> >                       goto out_unmap;
>
Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
Posted by Lance Yang 2 weeks ago

On 2026/1/23 23:08, Vernon Yang wrote:
> On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
>>
>> On 2026/1/23 16:22, Vernon Yang wrote:
>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>

[...]

>>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>                folio = page_folio(page);
>>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>>
>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>
>> I'm wondering if we need "cc->is_khugepaged &&" as well here?
>>
>> We should allow users to enforce collapse via the madvise_collapse()
>> path even if pages are marked lazyfree, IMHO.
> 
> $ man madvise
> MADV_COLLAPSE
>          Perform a best-effort synchronous collapse of the native pages
>          mapped by the memory range into Transparent Huge Pages (THPs).
> 
> The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
> collapsing, so we don't need "cc->is_khugepaged" here.
> 
> We can imagine that if a user simultaneously uses MADV_FREE and
> MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
> As the kernel, we need to safeguard the baseline.

No. Afraid I don't think so.

To be clear, what I meant by "enforce":

Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
call MADV_COLLAPSE, they're explicitly asking for collapse.

Compared to khugepaged just scanning around, that's already "enforce"
- users are actively requesting it, not passively waiting for.

Note that you're *breaking* userspace. Users would not be able
to collapse the range where there are any lazyfree pages anymore,
even when they explicitly call MADV_COLLAPSE.

For khugepaged, skipping lazyfree makes sense.

> 
>>> +                     result = SCAN_PAGE_LAZYFREE;
>>> +                     goto out;
>>> +             }
>>> +
>>>                /* See hpage_collapse_scan_pmd(). */
>>>                if (folio_maybe_mapped_shared(folio)) {
>>>                        ++shared;
>>> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>                }
>>>                folio = page_folio(page);
>>>
>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>
>> Ditto.
>>
>>> +                     result = SCAN_PAGE_LAZYFREE;
>>> +                     goto out_unmap;
>>> +             }
>>> +
>>>                if (!folio_test_anon(folio)) {
>>>                        result = SCAN_PAGE_ANON;
>>>                        goto out_unmap;
>>

Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
Posted by Vernon Yang 2 weeks ago
On Sat, Jan 24, 2026 at 12:32 AM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/23 23:08, Vernon Yang wrote:
> > On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
> >>
> >> On 2026/1/23 16:22, Vernon Yang wrote:
> >>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>
>
> [...]
>
> >>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>                folio = page_folio(page);
> >>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >>>
> >>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
> >>
> >> I'm wondering if we need "cc->is_khugepaged &&" as well here?
> >>
> >> We should allow users to enforce collapse via the madvise_collapse()
> >> path even if pages are marked lazyfree, IMHO.
> >
> > $ man madvise
> > MADV_COLLAPSE
> >          Perform a best-effort synchronous collapse of the native pages
> >          mapped by the memory range into Transparent Huge Pages (THPs).
> >
> > The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
> > collapsing, so we don't need "cc->is_khugepaged" here.
> >
> > We can imagine that if a user simultaneously uses MADV_FREE and
> > MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
> > As the kernel, we need to safeguard the baseline.
>
> No. Afraid I don't think so.
>
> To be clear, what I meant by "enforce":
>
> Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
> call MADV_COLLAPSE, they're explicitly asking for collapse.
>
> Compared to khugepaged just scanning around, that's already "enforce"
> - users are actively requesting it, not passively waiting for.
>
> Note that you're *breaking* userspace. Users would not be able
> to collapse the range where there are any lazyfree pages anymore,
> even when they explicitly call MADV_COLLAPSE.
>
> For khugepaged, skipping lazyfree makes sense.

I got your meaning, this is equivalent to two questions:

1. Does the semantics of best-effort imply any "enforce" meaning?
2. When madvise(MADV_FREE| MADV_COLLAPSE), do we want to collapse
   lazyfree folios?

This is a semantic warning, and I'd like to hear others' opinions.

> >
> >>> +                     result = SCAN_PAGE_LAZYFREE;
> >>> +                     goto out;
> >>> +             }
> >>> +
> >>>                /* See hpage_collapse_scan_pmd(). */
> >>>                if (folio_maybe_mapped_shared(folio)) {
> >>>                        ++shared;
> >>> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>                }
> >>>                folio = page_folio(page);
> >>>
> >>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
> >>
> >> Ditto.
> >>
> >>> +                     result = SCAN_PAGE_LAZYFREE;
> >>> +                     goto out_unmap;
> >>> +             }
> >>> +
> >>>                if (!folio_test_anon(folio)) {
> >>>                        result = SCAN_PAGE_ANON;
> >>>                        goto out_unmap;
> >>
>
Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
Posted by Dev Jain 2 weeks ago
On 24/01/26 8:52 am, Vernon Yang wrote:
> On Sat, Jan 24, 2026 at 12:32 AM Lance Yang <lance.yang@linux.dev> wrote:
>> On 2026/1/23 23:08, Vernon Yang wrote:
>>> On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
>>>> On 2026/1/23 16:22, Vernon Yang wrote:
>>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>>>
>> [...]
>>
>>>>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>>>                folio = page_folio(page);
>>>>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>>>>
>>>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>>> I'm wondering if we need "cc->is_khugepaged &&" as well here?
>>>>
>>>> We should allow users to enforce collapse via the madvise_collapse()
>>>> path even if pages are marked lazyfree, IMHO.
>>> $ man madvise
>>> MADV_COLLAPSE
>>>          Perform a best-effort synchronous collapse of the native pages
>>>          mapped by the memory range into Transparent Huge Pages (THPs).
>>>
>>> The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
>>> collapsing, so we don't need "cc->is_khugepaged" here.
>>>
>>> We can imagine that if a user simultaneously uses MADV_FREE and
>>> MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
>>> As the kernel, we need to safeguard the baseline.
>> No. Afraid I don't think so.
>>
>> To be clear, what I meant by "enforce":
>>
>> Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
>> call MADV_COLLAPSE, they're explicitly asking for collapse.
>>
>> Compared to khugepaged just scanning around, that's already "enforce"
>> - users are actively requesting it, not passively waiting for.
>>
>> Note that you're *breaking* userspace. Users would not be able
>> to collapse the range where there are any lazyfree pages anymore,
>> even when they explicitly call MADV_COLLAPSE.
>>
>> For khugepaged, skipping lazyfree makes sense.
> I got your meaning, this is equivalent to two questions:
>
> 1. Does the semantics of best-effort imply any "enforce" meaning?
> 2. When madvise(MADV_FREE| MADV_COLLAPSE), do we want to collapse
>    lazyfree folios?
>
> This is a semantic warning, and I'd like to hear others' opinions.

Lance is right. When user does MADV_COLLAPSE, kernel needs to try its
best to collapse. It may not be in the best interest of the user to
do MADV_FREE then MADV_COLLAPSE, but that is something the user has
to fix - kernel does not need to think about it.

Regarding "best-effort", it is best-effort in the sense that, the
madvise(MADV_COLLAPSE) is a syscall needed not for correctness,
but for optimization purposes. So it is not the end of the world
if the syscall fails. But, since the user has decided to do an
expensive operation (syscall), kernel needs to try harder to
make sure those CPU cycles weren't a waste.

>
>>>>> +                     result = SCAN_PAGE_LAZYFREE;
>>>>> +                     goto out;
>>>>> +             }
>>>>> +
>>>>>                /* See hpage_collapse_scan_pmd(). */
>>>>>                if (folio_maybe_mapped_shared(folio)) {
>>>>>                        ++shared;
>>>>> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>                }
>>>>>                folio = page_folio(page);
>>>>>
>>>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>>> Ditto.
>>>>
>>>>> +                     result = SCAN_PAGE_LAZYFREE;
>>>>> +                     goto out_unmap;
>>>>> +             }
>>>>> +
>>>>>                if (!folio_test_anon(folio)) {
>>>>>                        result = SCAN_PAGE_ANON;
>>>>>                        goto out_unmap;
Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
Posted by Barry Song 1 week, 5 days ago
On Sat, Jan 24, 2026 at 2:48 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 24/01/26 8:52 am, Vernon Yang wrote:
> > On Sat, Jan 24, 2026 at 12:32 AM Lance Yang <lance.yang@linux.dev> wrote:
> >> On 2026/1/23 23:08, Vernon Yang wrote:
> >>> On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
> >>>> On 2026/1/23 16:22, Vernon Yang wrote:
> >>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>>>
> >> [...]
> >>
> >>>>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>>>                folio = page_folio(page);
> >>>>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >>>>>
> >>>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
> >>>> I'm wondering if we need "cc->is_khugepaged &&" as well here?
> >>>>
> >>>> We should allow users to enforce collapse via the madvise_collapse()
> >>>> path even if pages are marked lazyfree, IMHO.
> >>> $ man madvise
> >>> MADV_COLLAPSE
> >>>          Perform a best-effort synchronous collapse of the native pages
> >>>          mapped by the memory range into Transparent Huge Pages (THPs).
> >>>
> >>> The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
> >>> collapsing, so we don't need "cc->is_khugepaged" here.
> >>>
> >>> We can imagine that if a user simultaneously uses MADV_FREE and
> >>> MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
> >>> As the kernel, we need to safeguard the baseline.
> >> No. Afraid I don't think so.
> >>
> >> To be clear, what I meant by "enforce":
> >>
> >> Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
> >> call MADV_COLLAPSE, they're explicitly asking for collapse.
> >>
> >> Compared to khugepaged just scanning around, that's already "enforce"
> >> - users are actively requesting it, not passively waiting for.
> >>
> >> Note that you're *breaking* userspace. Users would not be able
> >> to collapse the range where there are any lazyfree pages anymore,
> >> even when they explicitly call MADV_COLLAPSE.
> >>
> >> For khugepaged, skipping lazyfree makes sense.
> > I got your meaning, this is equivalent to two questions:
> >
> > 1. Does the semantics of best-effort imply any "enforce" meaning?
> > 2. When madvise(MADV_FREE| MADV_COLLAPSE), do we want to collapse
> >    lazyfree folios?
> >
> > This is a semantic warning, and I'd like to hear others' opinions.
>

That said, it does feel a bit unfortunate. I was wondering whether we
want to give users a hint in this case, e.g. via something like:

pr_warn("Attempt to enforce hugepage collapse on lazyfree memory");

But I'm not sure whether this is actually worth a printk, or if it would
just add noise without providing actionable value.

> Regarding "best-effort", it is best-effort in the sense that, the
> madvise(MADV_COLLAPSE) is a syscall needed not for correctness,
> but for optimization purposes. So it is not the end of the world
> if the syscall fails. But, since the user has decided to do an
> expensive operation (syscall), kernel needs to try harder to
> make sure those CPU cycles weren't a waste.
>

Thanks
Barry