[v3] Improve khugepaged scan logic

[PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Vernon Yang 1 month ago

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/trace/events/huge_memory.h | 1 +
 mm/khugepaged.c                    | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 01225dd27ad5..e99d5f71f2a4 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
 	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
 	EM( SCAN_PAGE_LOCK,		"page_locked")			\
 	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
+	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
 	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
 	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
 	EM( SCAN_VMA_NULL,		"vma_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 30786c706c4a..1ca034a5f653 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -45,6 +45,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);
 
+		if (folio_is_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out_unmap;
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;
-- 
2.51.0

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Lance Yang 1 month ago


On 2026/1/4 13:41, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h | 1 +
>   mm/khugepaged.c                    | 6 ++++++
>   2 files changed, 7 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 01225dd27ad5..e99d5f71f2a4 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
>   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>   	EM( SCAN_VMA_NULL,		"vma_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 30786c706c4a..1ca034a5f653 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -45,6 +45,7 @@ enum scan_result {
>   	SCAN_PAGE_LRU,
>   	SCAN_PAGE_LOCK,
>   	SCAN_PAGE_ANON,
> +	SCAN_PAGE_LAZYFREE,
>   	SCAN_PAGE_COMPOUND,
>   	SCAN_ANY_PROCESS,
>   	SCAN_VMA_NULL,
> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		}
>   		folio = page_folio(page);
>   
> +		if (folio_is_lazyfree(folio)) {
> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out_unmap;
> +		}

That's a bit tricky ... I don't think we need to handle MADV_FREE pages
differently :)

MADV_FREE pages are likely cold memory, but what if there are just
a few MADV_FREE pages in a hot memory region? Skipping the entire
region would be unfortunate ...

Also, even if we skip these pages now, after they are reclaimed, they
become pte_none. Then khugepaged will try to collapse them anyway
(based on khugepaged_max_ptes_none). So skipping them just delays
things, it does not really change the final result ;)

Thanks,
Lance

> +
>   		if (!folio_test_anon(folio)) {
>   			result = SCAN_PAGE_ANON;
>   			goto out_unmap;

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Vernon Yang 1 month ago

On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
>
>
> On 2026/1/4 13:41, Vernon Yang wrote:
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h | 1 +
> >   mm/khugepaged.c                    | 6 ++++++
> >   2 files changed, 7 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 01225dd27ad5..e99d5f71f2a4 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> >   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
> >   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
> >   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> > +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
> >   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
> >   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
> >   	EM( SCAN_VMA_NULL,		"vma_null")			\
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 30786c706c4a..1ca034a5f653 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -45,6 +45,7 @@ enum scan_result {
> >   	SCAN_PAGE_LRU,
> >   	SCAN_PAGE_LOCK,
> >   	SCAN_PAGE_ANON,
> > +	SCAN_PAGE_LAZYFREE,
> >   	SCAN_PAGE_COMPOUND,
> >   	SCAN_ANY_PROCESS,
> >   	SCAN_VMA_NULL,
> > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >   		}
> >   		folio = page_folio(page);
> > +		if (folio_is_lazyfree(folio)) {
> > +			result = SCAN_PAGE_LAZYFREE;
> > +			goto out_unmap;
> > +		}
>
> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> differently :)
>
> MADV_FREE pages are likely cold memory, but what if there are just
> a few MADV_FREE pages in a hot memory region? Skipping the entire
> region would be unfortunate ...

If there are hot in lazyfree folios, the folio will be set as non-lazyfree
in the memory reclaim path, it is not skipped in the next scan in the
khugepaged.

shrink_folio_list()
  try_to_unmap()
    folio_set_swapbacked()

If there are no hot in lazyfree folios, continuing the collapse would
waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
Additionally, due to collapse hugepage become non-lazyfree, preventing
the rapid release of lazyfree folios in the memory reclaim path.

So skipping lazy-free folios make sense here for us.

If I missed something, please let me know, thank!

> Also, even if we skip these pages now, after they are reclaimed, they
> become pte_none. Then khugepaged will try to collapse them anyway
> (based on khugepaged_max_ptes_none). So skipping them just delays
> things, it does not really change the final result ;)

This patch just resolve scene for hot1 -> cold -> hot2.

--
Thanks,
Vernon

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Lance Yang 1 month ago


On 2026/1/5 09:48, Vernon Yang wrote:
> On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
>>
>>
>> On 2026/1/4 13:41, Vernon Yang wrote:
>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>> continuously access 128 MB memory, while the cold task only accesses
>>> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
>>> still prioritizes scanning the cold task and only scans the hot2 task
>>> after completing the scan of the cold task.
>>>
>>> So if the user has explicitly informed us via MADV_FREE that this memory
>>> will be freed, it is appropriate for khugepaged to skip it only, thereby
>>> avoiding unnecessary scan and collapse operations to reducing CPU
>>> wastage.
>>>
>>> Here are the performance test results:
>>> (Throughput bigger is better, other smaller is better)
>>>
>>> Testing on x86_64 machine:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
>>> | cycles per access   |  4.96         |  2.21         | -55.44% |
>>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
>>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
>>>
>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>> | cycles per access   |  7.29         |  2.07         | -71.60% |
>>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
>>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>>    include/trace/events/huge_memory.h | 1 +
>>>    mm/khugepaged.c                    | 6 ++++++
>>>    2 files changed, 7 insertions(+)
>>>
>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>> index 01225dd27ad5..e99d5f71f2a4 100644
>>> --- a/include/trace/events/huge_memory.h
>>> +++ b/include/trace/events/huge_memory.h
>>> @@ -25,6 +25,7 @@
>>>    	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>>>    	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>>>    	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
>>> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>>>    	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>>>    	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>>>    	EM( SCAN_VMA_NULL,		"vma_null")			\
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 30786c706c4a..1ca034a5f653 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -45,6 +45,7 @@ enum scan_result {
>>>    	SCAN_PAGE_LRU,
>>>    	SCAN_PAGE_LOCK,
>>>    	SCAN_PAGE_ANON,
>>> +	SCAN_PAGE_LAZYFREE,
>>>    	SCAN_PAGE_COMPOUND,
>>>    	SCAN_ANY_PROCESS,
>>>    	SCAN_VMA_NULL,
>>> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>    		}
>>>    		folio = page_folio(page);
>>> +		if (folio_is_lazyfree(folio)) {
>>> +			result = SCAN_PAGE_LAZYFREE;
>>> +			goto out_unmap;
>>> +		}
>>
>> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
>> differently :)
>>
>> MADV_FREE pages are likely cold memory, but what if there are just
>> a few MADV_FREE pages in a hot memory region? Skipping the entire
>> region would be unfortunate ...
> 
> If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> in the memory reclaim path, it is not skipped in the next scan in the
> khugepaged.
> 
> shrink_folio_list()
>    try_to_unmap()
>      folio_set_swapbacked()
> 
> If there are no hot in lazyfree folios, continuing the collapse would
> waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> Additionally, due to collapse hugepage become non-lazyfree, preventing
> the rapid release of lazyfree folios in the memory reclaim path.
> 
> So skipping lazy-free folios make sense here for us.
> 
> If I missed something, please let me know, thank!

I'm not saying lazyfree pages become hot :)

If a PMD region has mostly hot pages but just a few lazyfree
pages, we would skip the entire region. Those hot pages won't
be collapsed.

> 
>> Also, even if we skip these pages now, after they are reclaimed, they
>> become pte_none. Then khugepaged will try to collapse them anyway
>> (based on khugepaged_max_ptes_none). So skipping them just delays
>> things, it does not really change the final result ;)
> 
> This patch just resolve scene for hot1 -> cold -> hot2.
> 
> --
> Thanks,
> Vernon

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Vernon Yang 1 month ago

On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/5 09:48, Vernon Yang wrote:
> > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> >>
> >>
> >> On 2026/1/4 13:41, Vernon Yang wrote:
> >>> For example, create three task: hot1 -> cold -> hot2. After all three
> >>> task are created, each allocate memory 128MB. the hot1/hot2 task
> >>> continuously access 128 MB memory, while the cold task only accesses
> >>> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> >>> still prioritizes scanning the cold task and only scans the hot2 task
> >>> after completing the scan of the cold task.
> >>>
> >>> So if the user has explicitly informed us via MADV_FREE that this memory
> >>> will be freed, it is appropriate for khugepaged to skip it only, thereby
> >>> avoiding unnecessary scan and collapse operations to reducing CPU
> >>> wastage.
> >>>
> >>> Here are the performance test results:
> >>> (Throughput bigger is better, other smaller is better)
> >>>
> >>> Testing on x86_64 machine:
> >>>
> >>> | task hot2           | without patch | with patch    |  delta  |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> >>> | cycles per access   |  4.96         |  2.21         | -55.44% |
> >>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> >>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >>>
> >>> Testing on qemu-system-x86_64 -enable-kvm:
> >>>
> >>> | task hot2           | without patch | with patch    |  delta  |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> >>> | cycles per access   |  7.29         |  2.07         | -71.60% |
> >>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> >>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >>>
> >>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >>> ---
> >>>    include/trace/events/huge_memory.h | 1 +
> >>>    mm/khugepaged.c                    | 6 ++++++
> >>>    2 files changed, 7 insertions(+)
> >>>
> >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> >>> index 01225dd27ad5..e99d5f71f2a4 100644
> >>> --- a/include/trace/events/huge_memory.h
> >>> +++ b/include/trace/events/huge_memory.h
> >>> @@ -25,6 +25,7 @@
> >>>     EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> >>>     EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> >>>     EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> >>> +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> >>>     EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> >>>     EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> >>>     EM( SCAN_VMA_NULL,              "vma_null")                     \
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 30786c706c4a..1ca034a5f653 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -45,6 +45,7 @@ enum scan_result {
> >>>     SCAN_PAGE_LRU,
> >>>     SCAN_PAGE_LOCK,
> >>>     SCAN_PAGE_ANON,
> >>> +   SCAN_PAGE_LAZYFREE,
> >>>     SCAN_PAGE_COMPOUND,
> >>>     SCAN_ANY_PROCESS,
> >>>     SCAN_VMA_NULL,
> >>> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>             }
> >>>             folio = page_folio(page);
> >>> +           if (folio_is_lazyfree(folio)) {
> >>> +                   result = SCAN_PAGE_LAZYFREE;
> >>> +                   goto out_unmap;
> >>> +           }
> >>
> >> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> >> differently :)
> >>
> >> MADV_FREE pages are likely cold memory, but what if there are just
> >> a few MADV_FREE pages in a hot memory region? Skipping the entire
> >> region would be unfortunate ...
> >
> > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > in the memory reclaim path, it is not skipped in the next scan in the
> > khugepaged.
> >
> > shrink_folio_list()
> >    try_to_unmap()
> >      folio_set_swapbacked()
> >
> > If there are no hot in lazyfree folios, continuing the collapse would
> > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > the rapid release of lazyfree folios in the memory reclaim path.
> >
> > So skipping lazy-free folios make sense here for us.
> >
> > If I missed something, please let me know, thank!
>
> I'm not saying lazyfree pages become hot :)
>
> If a PMD region has mostly hot pages but just a few lazyfree
> pages, we would skip the entire region. Those hot pages won't
> be collapsed.

Same above, the lazyfree folios will be set as non-lazyfree
in the memory reclaim path, it is not skipped in the next scan,
the PMD region will collapse :)

> >
> >> Also, even if we skip these pages now, after they are reclaimed, they
> >> become pte_none. Then khugepaged will try to collapse them anyway
> >> (based on khugepaged_max_ptes_none). So skipping them just delays
> >> things, it does not really change the final result ;)
> >
> > This patch just resolve scene for hot1 -> cold -> hot2.
> >
> > --
> > Thanks,
> > Vernon
>

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Lance Yang 1 month ago


On 2026/1/5 11:12, Vernon Yang wrote:
> On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>> On 2026/1/5 09:48, Vernon Yang wrote:
>>> On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
>>>>
>>>>
>>>> On 2026/1/4 13:41, Vernon Yang wrote:
>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>> after completing the scan of the cold task.
>>>>>
>>>>> So if the user has explicitly informed us via MADV_FREE that this memory
>>>>> will be freed, it is appropriate for khugepaged to skip it only, thereby
>>>>> avoiding unnecessary scan and collapse operations to reducing CPU
>>>>> wastage.
>>>>>
>>>>> Here are the performance test results:
>>>>> (Throughput bigger is better, other smaller is better)
>>>>>
>>>>> Testing on x86_64 machine:
>>>>>
>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>> |---------------------|---------------|---------------|---------|
>>>>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
>>>>> | cycles per access   |  4.96         |  2.21         | -55.44% |
>>>>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
>>>>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
>>>>>
>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>
>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>> |---------------------|---------------|---------------|---------|
>>>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>>>> | cycles per access   |  7.29         |  2.07         | -71.60% |
>>>>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
>>>>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
>>>>>
>>>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>>>> ---
>>>>>     include/trace/events/huge_memory.h | 1 +
>>>>>     mm/khugepaged.c                    | 6 ++++++
>>>>>     2 files changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>>>> index 01225dd27ad5..e99d5f71f2a4 100644
>>>>> --- a/include/trace/events/huge_memory.h
>>>>> +++ b/include/trace/events/huge_memory.h
>>>>> @@ -25,6 +25,7 @@
>>>>>      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
>>>>>      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
>>>>>      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
>>>>> +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
>>>>>      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
>>>>>      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
>>>>>      EM( SCAN_VMA_NULL,              "vma_null")                     \
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 30786c706c4a..1ca034a5f653 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -45,6 +45,7 @@ enum scan_result {
>>>>>      SCAN_PAGE_LRU,
>>>>>      SCAN_PAGE_LOCK,
>>>>>      SCAN_PAGE_ANON,
>>>>> +   SCAN_PAGE_LAZYFREE,
>>>>>      SCAN_PAGE_COMPOUND,
>>>>>      SCAN_ANY_PROCESS,
>>>>>      SCAN_VMA_NULL,
>>>>> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>              }
>>>>>              folio = page_folio(page);
>>>>> +           if (folio_is_lazyfree(folio)) {
>>>>> +                   result = SCAN_PAGE_LAZYFREE;
>>>>> +                   goto out_unmap;
>>>>> +           }
>>>>
>>>> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
>>>> differently :)
>>>>
>>>> MADV_FREE pages are likely cold memory, but what if there are just
>>>> a few MADV_FREE pages in a hot memory region? Skipping the entire
>>>> region would be unfortunate ...
>>>
>>> If there are hot in lazyfree folios, the folio will be set as non-lazyfree
>>> in the memory reclaim path, it is not skipped in the next scan in the
>>> khugepaged.
>>>
>>> shrink_folio_list()
>>>     try_to_unmap()
>>>       folio_set_swapbacked()
>>>
>>> If there are no hot in lazyfree folios, continuing the collapse would
>>> waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
>>> Additionally, due to collapse hugepage become non-lazyfree, preventing
>>> the rapid release of lazyfree folios in the memory reclaim path.
>>>
>>> So skipping lazy-free folios make sense here for us.
>>>
>>> If I missed something, please let me know, thank!
>>
>> I'm not saying lazyfree pages become hot :)
>>
>> If a PMD region has mostly hot pages but just a few lazyfree
>> pages, we would skip the entire region. Those hot pages won't
>> be collapsed.
> 
> Same above, the lazyfree folios will be set as non-lazyfree

Nop ...

> in the memory reclaim path, it is not skipped in the next scan,
> the PMD region will collapse :)

Let me be more specific:

Assume we have a PMD region (512 pages):
- Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
- Pages 500-511: lazyfree pages (MADV_FREE'd and clean)

This patch skips the entire region when it hits page 500. So pages
0-499 can't be collapsed, even though they are hot.

I'm NOT saying lazyfree pages themselves become hot ;)

As I mentioned earlier, even if we skip these pages now, after they
are reclaimed they become pte_none. Then khugepaged will try to
collapse them anyway (based on khugepaged_max_ptes_none). So
skipping them just delays things, it does not really change the
final result ...

> 
>>>
>>>> Also, even if we skip these pages now, after they are reclaimed, they
>>>> become pte_none. Then khugepaged will try to collapse them anyway
>>>> (based on khugepaged_max_ptes_none). So skipping them just delays
>>>> things, it does not really change the final result ;)
>>>
>>> This patch just resolve scene for hot1 -> cold -> hot2.
>>>
>>> --
>>> Thanks,
>>> Vernon
>>

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Vernon Yang 1 month ago

On Mon, Jan 05, 2026 at 11:35:58AM +0800, Lance Yang wrote:
>
>
> On 2026/1/5 11:12, Vernon Yang wrote:
> > On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
> > >
> > > On 2026/1/5 09:48, Vernon Yang wrote:
> > > > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> > > > >
> > > > >
> > > > > On 2026/1/4 13:41, Vernon Yang wrote:
> > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > after completing the scan of the cold task.
> > > > > >
> > > > > > So if the user has explicitly informed us via MADV_FREE that this memory
> > > > > > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > > > > > avoiding unnecessary scan and collapse operations to reducing CPU
> > > > > > wastage.
> > > > > >
> > > > > > Here are the performance test results:
> > > > > > (Throughput bigger is better, other smaller is better)
> > > > > >
> > > > > > Testing on x86_64 machine:
> > > > > >
> > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > |---------------------|---------------|---------------|---------|
> > > > > > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > > > > > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > > > > > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > > > > > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> > > > > >
> > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > >
> > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > |---------------------|---------------|---------------|---------|
> > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > > > > > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > > > > > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> > > > > >
> > > > > > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > > > > > ---
> > > > > >     include/trace/events/huge_memory.h | 1 +
> > > > > >     mm/khugepaged.c                    | 6 ++++++
> > > > > >     2 files changed, 7 insertions(+)
> > > > > >
> > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > > > index 01225dd27ad5..e99d5f71f2a4 100644
> > > > > > --- a/include/trace/events/huge_memory.h
> > > > > > +++ b/include/trace/events/huge_memory.h
> > > > > > @@ -25,6 +25,7 @@
> > > > > >      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> > > > > >      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> > > > > >      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > > > > > +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> > > > > >      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> > > > > >      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> > > > > >      EM( SCAN_VMA_NULL,              "vma_null")                     \
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index 30786c706c4a..1ca034a5f653 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -45,6 +45,7 @@ enum scan_result {
> > > > > >      SCAN_PAGE_LRU,
> > > > > >      SCAN_PAGE_LOCK,
> > > > > >      SCAN_PAGE_ANON,
> > > > > > +   SCAN_PAGE_LAZYFREE,
> > > > > >      SCAN_PAGE_COMPOUND,
> > > > > >      SCAN_ANY_PROCESS,
> > > > > >      SCAN_VMA_NULL,
> > > > > > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > > > > >              }
> > > > > >              folio = page_folio(page);
> > > > > > +           if (folio_is_lazyfree(folio)) {
> > > > > > +                   result = SCAN_PAGE_LAZYFREE;
> > > > > > +                   goto out_unmap;
> > > > > > +           }
> > > > >
> > > > > That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> > > > > differently :)
> > > > >
> > > > > MADV_FREE pages are likely cold memory, but what if there are just
> > > > > a few MADV_FREE pages in a hot memory region? Skipping the entire
> > > > > region would be unfortunate ...
> > > >
> > > > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > > > in the memory reclaim path, it is not skipped in the next scan in the
> > > > khugepaged.
> > > >
> > > > shrink_folio_list()
> > > >     try_to_unmap()
> > > >       folio_set_swapbacked()
> > > >
> > > > If there are no hot in lazyfree folios, continuing the collapse would
> > > > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > > > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > > > the rapid release of lazyfree folios in the memory reclaim path.
> > > >
> > > > So skipping lazy-free folios make sense here for us.
> > > >
> > > > If I missed something, please let me know, thank!
> > >
> > > I'm not saying lazyfree pages become hot :)
> > >
> > > If a PMD region has mostly hot pages but just a few lazyfree
> > > pages, we would skip the entire region. Those hot pages won't
> > > be collapsed.
> >
> > Same above, the lazyfree folios will be set as non-lazyfree
>
> Nop ...
>
> > in the memory reclaim path, it is not skipped in the next scan,
> > the PMD region will collapse :)
>
> Let me be more specific:
>
> Assume we have a PMD region (512 pages):
> - Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
> - Pages 500-511: lazyfree pages (MADV_FREE'd and clean)
>
> This patch skips the entire region when it hits page 500. So pages
> 0-499 can't be collapsed, even though they are hot.
>
> I'm NOT saying lazyfree pages themselves become hot ;)
>
> As I mentioned earlier, even if we skip these pages now, after they
> are reclaimed they become pte_none. Then khugepaged will try to
> collapse them anyway (based on khugepaged_max_ptes_none). So
> skipping them just delays things, it does not really change the
> final result ...

I got it. Thank you for explain.
I refine the code, it can resolve this issue, as follows:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 30786c706c4a..afea2e12394e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -45,6 +45,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -1256,6 +1257,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	pte_t *pte, *_pte;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
+	int lazyfree = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr;
@@ -1337,6 +1339,21 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);

+		if (cc->is_khugepaged && !pte_dirty(pteval) &&
+		    folio_is_lazyfree(folio)) {
+			++lazyfree;
+
+			/*
+			 * Due to the lazyfree-folios is reclaimed become
+			 * pte_none, make sure it doesn't continue to be
+			 * collapsed when skip ahead.
+			 */
+			if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
+				result = SCAN_PAGE_LAZYFREE;
+				goto out_unmap;
+			}
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;


If it has anything bug or better idea, please let me know, thanks!
If no, I will send it in the next version.

--
Thanks,
Vernon

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Barry Song 1 month ago

On Tue, Jan 6, 2026 at 1:31 AM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> On Mon, Jan 05, 2026 at 11:35:58AM +0800, Lance Yang wrote:
> >
> >
> > On 2026/1/5 11:12, Vernon Yang wrote:
> > > On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
> > > >
> > > > On 2026/1/5 09:48, Vernon Yang wrote:
> > > > > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> > > > > >
> > > > > >
> > > > > > On 2026/1/4 13:41, Vernon Yang wrote:
> > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > after completing the scan of the cold task.
> > > > > > >
> > > > > > > So if the user has explicitly informed us via MADV_FREE that this memory
> > > > > > > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > > > > > > avoiding unnecessary scan and collapse operations to reducing CPU
> > > > > > > wastage.
> > > > > > >
> > > > > > > Here are the performance test results:
> > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > >
> > > > > > > Testing on x86_64 machine:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > > > > > > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > > > > > > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > > > > > > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> > > > > > >
> > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > > > > > > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > > > > > > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> > > > > > >
> > > > > > > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > > > > > > ---
> > > > > > >     include/trace/events/huge_memory.h | 1 +
> > > > > > >     mm/khugepaged.c                    | 6 ++++++
> > > > > > >     2 files changed, 7 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > > > > index 01225dd27ad5..e99d5f71f2a4 100644
> > > > > > > --- a/include/trace/events/huge_memory.h
> > > > > > > +++ b/include/trace/events/huge_memory.h
> > > > > > > @@ -25,6 +25,7 @@
> > > > > > >      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> > > > > > >      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> > > > > > >      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > > > > > > +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> > > > > > >      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> > > > > > >      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> > > > > > >      EM( SCAN_VMA_NULL,              "vma_null")                     \
> > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > index 30786c706c4a..1ca034a5f653 100644
> > > > > > > --- a/mm/khugepaged.c
> > > > > > > +++ b/mm/khugepaged.c
> > > > > > > @@ -45,6 +45,7 @@ enum scan_result {
> > > > > > >      SCAN_PAGE_LRU,
> > > > > > >      SCAN_PAGE_LOCK,
> > > > > > >      SCAN_PAGE_ANON,
> > > > > > > +   SCAN_PAGE_LAZYFREE,
> > > > > > >      SCAN_PAGE_COMPOUND,
> > > > > > >      SCAN_ANY_PROCESS,
> > > > > > >      SCAN_VMA_NULL,
> > > > > > > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > > > > > >              }
> > > > > > >              folio = page_folio(page);
> > > > > > > +           if (folio_is_lazyfree(folio)) {
> > > > > > > +                   result = SCAN_PAGE_LAZYFREE;
> > > > > > > +                   goto out_unmap;
> > > > > > > +           }
> > > > > >
> > > > > > That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> > > > > > differently :)
> > > > > >
> > > > > > MADV_FREE pages are likely cold memory, but what if there are just
> > > > > > a few MADV_FREE pages in a hot memory region? Skipping the entire
> > > > > > region would be unfortunate ...
> > > > >
> > > > > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > > > > in the memory reclaim path, it is not skipped in the next scan in the
> > > > > khugepaged.
> > > > >
> > > > > shrink_folio_list()
> > > > >     try_to_unmap()
> > > > >       folio_set_swapbacked()
> > > > >
> > > > > If there are no hot in lazyfree folios, continuing the collapse would
> > > > > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > > > > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > > > > the rapid release of lazyfree folios in the memory reclaim path.
> > > > >
> > > > > So skipping lazy-free folios make sense here for us.
> > > > >
> > > > > If I missed something, please let me know, thank!
> > > >
> > > > I'm not saying lazyfree pages become hot :)
> > > >
> > > > If a PMD region has mostly hot pages but just a few lazyfree
> > > > pages, we would skip the entire region. Those hot pages won't
> > > > be collapsed.
> > >
> > > Same above, the lazyfree folios will be set as non-lazyfree
> >
> > Nop ...
> >
> > > in the memory reclaim path, it is not skipped in the next scan,
> > > the PMD region will collapse :)
> >
> > Let me be more specific:
> >
> > Assume we have a PMD region (512 pages):
> > - Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
> > - Pages 500-511: lazyfree pages (MADV_FREE'd and clean)
> >
> > This patch skips the entire region when it hits page 500. So pages
> > 0-499 can't be collapsed, even though they are hot.
> >
> > I'm NOT saying lazyfree pages themselves become hot ;)
> >
> > As I mentioned earlier, even if we skip these pages now, after they
> > are reclaimed they become pte_none. Then khugepaged will try to
> > collapse them anyway (based on khugepaged_max_ptes_none). So
> > skipping them just delays things, it does not really change the
> > final result ...
>
> I got it. Thank you for explain.
> I refine the code, it can resolve this issue, as follows:
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 30786c706c4a..afea2e12394e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -45,6 +45,7 @@ enum scan_result {
>         SCAN_PAGE_LRU,
>         SCAN_PAGE_LOCK,
>         SCAN_PAGE_ANON,
> +       SCAN_PAGE_LAZYFREE,
>         SCAN_PAGE_COMPOUND,
>         SCAN_ANY_PROCESS,
>         SCAN_VMA_NULL,
> @@ -1256,6 +1257,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>         pte_t *pte, *_pte;
>         int result = SCAN_FAIL, referenced = 0;
>         int none_or_zero = 0, shared = 0;
> +       int lazyfree = 0;
>         struct page *page = NULL;
>         struct folio *folio = NULL;
>         unsigned long addr;
> @@ -1337,6 +1339,21 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>                 }
>                 folio = page_folio(page);
>
> +               if (cc->is_khugepaged && !pte_dirty(pteval) &&
> +                   folio_is_lazyfree(folio)) {
> +                       ++lazyfree;
> +
> +                       /*
> +                        * Due to the lazyfree-folios is reclaimed become
> +                        * pte_none, make sure it doesn't continue to be
> +                        * collapsed when skip ahead.
> +                        */
> +                       if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
> +                               result = SCAN_PAGE_LAZYFREE;
> +                               goto out_unmap;
> +                       }
> +               }
> +

I am still not fully convinced that this is the correct approach. You may
want to look at jemalloc or scudo to see how userspace heaps use
MADV_FREE for small size classes. In practice, it can be quite
difficult to form a large range of PTEs that are all marked lazyfree.
From my perspective, it would make more sense not to collapse the
entire range if only part of it is lazyfree.
I mean:
for ptes as below,
    lazyfree, lazyfree, non-lazyfree, non-lazyfree

Collapsing the range is unnecessary, as the first two entries are likely
to be freed soon.

>                 if (!folio_test_anon(folio)) {
>                         result = SCAN_PAGE_ANON;
>                         goto out_unmap;
>
>
> If it has anything bug or better idea, please let me know, thanks!
> If no, I will send it in the next version.
>
> --
> Thanks,
> Vernon

Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning

Posted by Vernon Yang 1 month ago

On Tue, Jan 06, 2026 at 11:33:35PM +1300, Barry Song wrote:
> On Tue, Jan 6, 2026 at 1:31 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> >
> > On Mon, Jan 05, 2026 at 11:35:58AM +0800, Lance Yang wrote:
> > >
> > >
> > > On 2026/1/5 11:12, Vernon Yang wrote:
> > > > On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
> > > > >
> > > > > On 2026/1/5 09:48, Vernon Yang wrote:
> > > > > > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 2026/1/4 13:41, Vernon Yang wrote:
> > > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > > after completing the scan of the cold task.
> > > > > > > >
> > > > > > > > So if the user has explicitly informed us via MADV_FREE that this memory
> > > > > > > > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > > > > > > > avoiding unnecessary scan and collapse operations to reducing CPU
> > > > > > > > wastage.
> > > > > > > >
> > > > > > > > Here are the performance test results:
> > > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > > >
> > > > > > > > Testing on x86_64 machine:
> > > > > > > >
> > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > > > > > > > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > > > > > > > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > > > > > > > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> > > > > > > >
> > > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > > >
> > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > > > > > > > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > > > > > > > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> > > > > > > >
> > > > > > > > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > > > > > > > ---
> > > > > > > >     include/trace/events/huge_memory.h | 1 +
> > > > > > > >     mm/khugepaged.c                    | 6 ++++++
> > > > > > > >     2 files changed, 7 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > > > > > index 01225dd27ad5..e99d5f71f2a4 100644
> > > > > > > > --- a/include/trace/events/huge_memory.h
> > > > > > > > +++ b/include/trace/events/huge_memory.h
> > > > > > > > @@ -25,6 +25,7 @@
> > > > > > > >      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> > > > > > > >      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> > > > > > > >      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > > > > > > > +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> > > > > > > >      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> > > > > > > >      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> > > > > > > >      EM( SCAN_VMA_NULL,              "vma_null")                     \
> > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > > index 30786c706c4a..1ca034a5f653 100644
> > > > > > > > --- a/mm/khugepaged.c
> > > > > > > > +++ b/mm/khugepaged.c
> > > > > > > > @@ -45,6 +45,7 @@ enum scan_result {
> > > > > > > >      SCAN_PAGE_LRU,
> > > > > > > >      SCAN_PAGE_LOCK,
> > > > > > > >      SCAN_PAGE_ANON,
> > > > > > > > +   SCAN_PAGE_LAZYFREE,
> > > > > > > >      SCAN_PAGE_COMPOUND,
> > > > > > > >      SCAN_ANY_PROCESS,
> > > > > > > >      SCAN_VMA_NULL,
> > > > > > > > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > > > > > > >              }
> > > > > > > >              folio = page_folio(page);
> > > > > > > > +           if (folio_is_lazyfree(folio)) {
> > > > > > > > +                   result = SCAN_PAGE_LAZYFREE;
> > > > > > > > +                   goto out_unmap;
> > > > > > > > +           }
> > > > > > >
> > > > > > > That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> > > > > > > differently :)
> > > > > > >
> > > > > > > MADV_FREE pages are likely cold memory, but what if there are just
> > > > > > > a few MADV_FREE pages in a hot memory region? Skipping the entire
> > > > > > > region would be unfortunate ...
> > > > > >
> > > > > > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > > > > > in the memory reclaim path, it is not skipped in the next scan in the
> > > > > > khugepaged.
> > > > > >
> > > > > > shrink_folio_list()
> > > > > >     try_to_unmap()
> > > > > >       folio_set_swapbacked()
> > > > > >
> > > > > > If there are no hot in lazyfree folios, continuing the collapse would
> > > > > > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > > > > > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > > > > > the rapid release of lazyfree folios in the memory reclaim path.
> > > > > >
> > > > > > So skipping lazy-free folios make sense here for us.
> > > > > >
> > > > > > If I missed something, please let me know, thank!
> > > > >
> > > > > I'm not saying lazyfree pages become hot :)
> > > > >
> > > > > If a PMD region has mostly hot pages but just a few lazyfree
> > > > > pages, we would skip the entire region. Those hot pages won't
> > > > > be collapsed.
> > > >
> > > > Same above, the lazyfree folios will be set as non-lazyfree
> > >
> > > Nop ...
> > >
> > > > in the memory reclaim path, it is not skipped in the next scan,
> > > > the PMD region will collapse :)
> > >
> > > Let me be more specific:
> > >
> > > Assume we have a PMD region (512 pages):
> > > - Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
> > > - Pages 500-511: lazyfree pages (MADV_FREE'd and clean)
> > >
> > > This patch skips the entire region when it hits page 500. So pages
> > > 0-499 can't be collapsed, even though they are hot.
> > >
> > > I'm NOT saying lazyfree pages themselves become hot ;)
> > >
> > > As I mentioned earlier, even if we skip these pages now, after they
> > > are reclaimed they become pte_none. Then khugepaged will try to
> > > collapse them anyway (based on khugepaged_max_ptes_none). So
> > > skipping them just delays things, it does not really change the
> > > final result ...

here

> >
> > I got it. Thank you for explain.
> > I refine the code, it can resolve this issue, as follows:
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 30786c706c4a..afea2e12394e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -45,6 +45,7 @@ enum scan_result {
> >         SCAN_PAGE_LRU,
> >         SCAN_PAGE_LOCK,
> >         SCAN_PAGE_ANON,
> > +       SCAN_PAGE_LAZYFREE,
> >         SCAN_PAGE_COMPOUND,
> >         SCAN_ANY_PROCESS,
> >         SCAN_VMA_NULL,
> > @@ -1256,6 +1257,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >         pte_t *pte, *_pte;
> >         int result = SCAN_FAIL, referenced = 0;
> >         int none_or_zero = 0, shared = 0;
> > +       int lazyfree = 0;
> >         struct page *page = NULL;
> >         struct folio *folio = NULL;
> >         unsigned long addr;
> > @@ -1337,6 +1339,21 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >                 }
> >                 folio = page_folio(page);
> >
> > +               if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > +                   folio_is_lazyfree(folio)) {
> > +                       ++lazyfree;
> > +
> > +                       /*
> > +                        * Due to the lazyfree-folios is reclaimed become
> > +                        * pte_none, make sure it doesn't continue to be
> > +                        * collapsed when skip ahead.
> > +                        */
> > +                       if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
> > +                               result = SCAN_PAGE_LAZYFREE;
> > +                               goto out_unmap;
> > +                       }
> > +               }
> > +
>
> I am still not fully convinced that this is the correct approach. You may
> want to look at jemalloc or scudo to see how userspace heaps use
> MADV_FREE for small size classes. In practice, it can be quite
> difficult to form a large range of PTEs that are all marked lazyfree.
> From my perspective, it would make more sense not to collapse the
> entire range if only part of it is lazyfree.
> I mean:
> for ptes as below,
>     lazyfree, lazyfree, non-lazyfree, non-lazyfree
>
> Collapsing the range is unnecessary, as the first two entries are likely
> to be freed soon.

But if the later two entries are hot, we not collapse, the describes of
Lance may occur.

> >                 if (!folio_test_anon(folio)) {
> >                         result = SCAN_PAGE_ANON;
> >                         goto out_unmap;
> >
> >
> > If it has anything bug or better idea, please let me know, thanks!
> > If no, I will send it in the next version.
> >
> > --
> > Thanks,
> > Vernon
>

[PATCH v3 1/6] mm: khugepaged: add trace_mm_khugepaged_scan event
[PATCH v3 2/6] mm: khugepaged: refine scan progress number
[PATCH v3 3/6] mm: khugepaged: just skip when the memory has been collapsed
[PATCH v3 4/6] mm: add folio_is_lazyfree helper
[PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
[PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY