[PATCH mm-new v4 5/6] mm: khugepaged: skip lazy-free folios at scanning

Vernon Yang posted 6 patches 4 weeks, 1 day ago
There is a newer version of this series
[PATCH mm-new v4 5/6] mm: khugepaged: skip lazy-free folios at scanning
Posted by Vernon Yang 4 weeks, 1 day ago
For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 3d1069c3f0c5..e3856f8ab9eb 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
 	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
 	EM( SCAN_PAGE_LOCK,		"page_locked")			\
 	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
+	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
 	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
 	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
 	EM( SCAN_VMA_NULL,		"vma_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6df2857d94c6..8a7008760566 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -46,6 +46,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -1258,6 +1259,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
 	int none_or_zero = 0, shared = 0, referenced = 0;
+	int lazyfree = 0;
 	enum scan_result result = SCAN_FAIL;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
@@ -1343,6 +1345,21 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);
 
+		if (cc->is_khugepaged && !pte_dirty(pteval) &&
+		    folio_is_lazyfree(folio)) {
+			++lazyfree;
+
+			/*
+			 * The lazyfree folios are reclaimed and become pte_none.
+			 * Ensure they do not continue to be collapsed when
+			 * skipped ahead.
+			 */
+			if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
+				result = SCAN_PAGE_LAZYFREE;
+				goto out_unmap;
+			}
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;
-- 
2.51.0
Re: [PATCH mm-new v4 5/6] mm: khugepaged: skip lazy-free folios at scanning
Posted by David Hildenbrand (Red Hat) 3 weeks, 5 days ago
On 1/11/26 13:19, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h |  1 +
>   mm/khugepaged.c                    | 17 +++++++++++++++++
>   2 files changed, 18 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 3d1069c3f0c5..e3856f8ab9eb 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
>   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>   	EM( SCAN_VMA_NULL,		"vma_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6df2857d94c6..8a7008760566 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
>   	SCAN_PAGE_LRU,
>   	SCAN_PAGE_LOCK,
>   	SCAN_PAGE_ANON,
> +	SCAN_PAGE_LAZYFREE,
>   	SCAN_PAGE_COMPOUND,
>   	SCAN_ANY_PROCESS,
>   	SCAN_VMA_NULL,
> @@ -1258,6 +1259,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>   	pmd_t *pmd;
>   	pte_t *pte, *_pte;
>   	int none_or_zero = 0, shared = 0, referenced = 0;
> +	int lazyfree = 0;
>   	enum scan_result result = SCAN_FAIL;
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
> @@ -1343,6 +1345,21 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		}
>   		folio = page_folio(page);
>   
> +		if (cc->is_khugepaged && !pte_dirty(pteval) &&
> +		    folio_is_lazyfree(folio)) {
> +			++lazyfree;
> +
> +			/*
> +			 * The lazyfree folios are reclaimed and become pte_none.
> +			 * Ensure they do not continue to be collapsed when
> +			 * skipped ahead.
> +			 */
> +			if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
> +				result = SCAN_PAGE_LAZYFREE;
> +				goto out_unmap;

I dislike adding another khugepaged_max_ptes_none check. Gah.


Can't we should just keep it simple and do

if (!pte_dirty(pteval) && folio_is_lazyfree(folio)) {
	result = SCAN_PAGE_LAZYFREE;
	goto out_unmap;
}

Reasoning: once they are none, we have a zero-filled page that e.g., the 
deferred shrinker can reclaim.

If you collapse with a lazyfree page, that content will never be none 
and the deferred shrinker cannot reclaim them.

So there is a real difference between them being none and them still 
being around.


We could also try turning them into none entries here, that is, test of 
we can discard them, to then just threat them like none entries.


Why don't we want to similarly handle this in 
__collapse_huge_page_isolate() ?

-- 
Cheers

David
Re: [PATCH mm-new v4 5/6] mm: khugepaged: skip lazy-free folios at scanning
Posted by Vernon Yang 3 weeks, 3 days ago
On Wed, Jan 14, 2026 at 7:50 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> On 1/11/26 13:19, Vernon Yang wrote:
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h |  1 +
> >   mm/khugepaged.c                    | 17 +++++++++++++++++
> >   2 files changed, 18 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 3d1069c3f0c5..e3856f8ab9eb 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> >       EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> >       EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> >       EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > +     EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> >       EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> >       EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> >       EM( SCAN_VMA_NULL,              "vma_null")                     \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 6df2857d94c6..8a7008760566 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -46,6 +46,7 @@ enum scan_result {
> >       SCAN_PAGE_LRU,
> >       SCAN_PAGE_LOCK,
> >       SCAN_PAGE_ANON,
> > +     SCAN_PAGE_LAZYFREE,
> >       SCAN_PAGE_COMPOUND,
> >       SCAN_ANY_PROCESS,
> >       SCAN_VMA_NULL,
> > @@ -1258,6 +1259,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> >       int none_or_zero = 0, shared = 0, referenced = 0;
> > +     int lazyfree = 0;
> >       enum scan_result result = SCAN_FAIL;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> > @@ -1343,6 +1345,21 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >               }
> >               folio = page_folio(page);
> >
> > +             if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > +                 folio_is_lazyfree(folio)) {
> > +                     ++lazyfree;
> > +
> > +                     /*
> > +                      * The lazyfree folios are reclaimed and become pte_none.
> > +                      * Ensure they do not continue to be collapsed when
> > +                      * skipped ahead.
> > +                      */
> > +                     if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
> > +                             result = SCAN_PAGE_LAZYFREE;
> > +                             goto out_unmap;
>
> I dislike adding another khugepaged_max_ptes_none check. Gah.
>
>
> Can't we should just keep it simple and do
>
> if (!pte_dirty(pteval) && folio_is_lazyfree(folio)) {
>         result = SCAN_PAGE_LAZYFREE;
>         goto out_unmap;
> }

LGTM, I will apply it in the next version. Thank you for review and
suggestions.

> Reasoning: once they are none, we have a zero-filled page that e.g., the
> deferred shrinker can reclaim.
>
> If you collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.

Nice! Thank you for explanation.

> So there is a real difference between them being none and them still
> being around.
>
>
> We could also try turning them into none entries here, that is, test of
> we can discard them, to then just threat them like none entries.
>
>
> Why don't we want to similarly handle this in
> __collapse_huge_page_isolate() ?

The same needs to be handled. Sorry, I missed it. I will fix it in
the next version.

--
Thanks,
Vernon
Re: [PATCH mm-new v4 5/6] mm: khugepaged: skip lazy-free folios at scanning
Posted by Lance Yang 3 weeks, 5 days ago

On 2026/1/14 19:50, David Hildenbrand (Red Hat) wrote:
> On 1/11/26 13:19, Vernon Yang wrote:
>> For example, create three task: hot1 -> cold -> hot2. After all three
>> task are created, each allocate memory 128MB. the hot1/hot2 task
>> continuously access 128 MB memory, while the cold task only accesses
>> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
>> still prioritizes scanning the cold task and only scans the hot2 task
>> after completing the scan of the cold task.
>>
>> So if the user has explicitly informed us via MADV_FREE that this memory
>> will be freed, it is appropriate for khugepaged to skip it only, thereby
>> avoiding unnecessary scan and collapse operations to reducing CPU
>> wastage.
>>
>> Here are the performance test results:
>> (Throughput bigger is better, other smaller is better)
>>
>> Testing on x86_64 machine:
>>
>> | task hot2           | without patch | with patch    |  delta  |
>> |---------------------|---------------|---------------|---------|
>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
>> | cycles per access   |  4.96         |  2.21         | -55.44% |
>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
>>
>> Testing on qemu-system-x86_64 -enable-kvm:
>>
>> | task hot2           | without patch | with patch    |  delta  |
>> |---------------------|---------------|---------------|---------|
>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>> | cycles per access   |  7.29         |  2.07         | -71.60% |
>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
>>
>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>> ---
>>   include/trace/events/huge_memory.h |  1 +
>>   mm/khugepaged.c                    | 17 +++++++++++++++++
>>   2 files changed, 18 insertions(+)
>>
>> diff --git a/include/trace/events/huge_memory.h b/include/trace/ 
>> events/huge_memory.h
>> index 3d1069c3f0c5..e3856f8ab9eb 100644
>> --- a/include/trace/events/huge_memory.h
>> +++ b/include/trace/events/huge_memory.h
>> @@ -25,6 +25,7 @@
>>       EM( SCAN_PAGE_LRU,        "page_not_in_lru")        \
>>       EM( SCAN_PAGE_LOCK,        "page_locked")            \
>>       EM( SCAN_PAGE_ANON,        "page_not_anon")        \
>> +    EM( SCAN_PAGE_LAZYFREE,        "page_lazyfree")        \
>>       EM( SCAN_PAGE_COMPOUND,        "page_compound")        \
>>       EM( SCAN_ANY_PROCESS,        "no_process_for_page")        \
>>       EM( SCAN_VMA_NULL,        "vma_null")            \
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 6df2857d94c6..8a7008760566 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -46,6 +46,7 @@ enum scan_result {
>>       SCAN_PAGE_LRU,
>>       SCAN_PAGE_LOCK,
>>       SCAN_PAGE_ANON,
>> +    SCAN_PAGE_LAZYFREE,
>>       SCAN_PAGE_COMPOUND,
>>       SCAN_ANY_PROCESS,
>>       SCAN_VMA_NULL,
>> @@ -1258,6 +1259,7 @@ static enum scan_result 
>> hpage_collapse_scan_pmd(struct mm_struct *mm,
>>       pmd_t *pmd;
>>       pte_t *pte, *_pte;
>>       int none_or_zero = 0, shared = 0, referenced = 0;
>> +    int lazyfree = 0;
>>       enum scan_result result = SCAN_FAIL;
>>       struct page *page = NULL;
>>       struct folio *folio = NULL;
>> @@ -1343,6 +1345,21 @@ static enum scan_result 
>> hpage_collapse_scan_pmd(struct mm_struct *mm,
>>           }
>>           folio = page_folio(page);
>> +        if (cc->is_khugepaged && !pte_dirty(pteval) &&
>> +            folio_is_lazyfree(folio)) {
>> +            ++lazyfree;
>> +
>> +            /*
>> +             * The lazyfree folios are reclaimed and become pte_none.
>> +             * Ensure they do not continue to be collapsed when
>> +             * skipped ahead.
>> +             */
>> +            if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
>> +                result = SCAN_PAGE_LAZYFREE;
>> +                goto out_unmap;
> 
> I dislike adding another khugepaged_max_ptes_none check. Gah.
> 
> 
> Can't we should just keep it simple and do
> 
> if (!pte_dirty(pteval) && folio_is_lazyfree(folio)) {
>      result = SCAN_PAGE_LAZYFREE;
>      goto out_unmap;
> }
> 
> Reasoning: once they are none, we have a zero-filled page that e.g., the 
> deferred shrinker can reclaim.
> 
> If you collapse with a lazyfree page, that content will never be none 
> and the deferred shrinker cannot reclaim them.
> 
> So there is a real difference between them being none and them still 
> being around.
> 
> 
> We could also try turning them into none entries here, that is, test of 
> we can discard them, to then just threat them like none entries.

Right, I would prefer turning them into none entries, but that seems to
complicate things a bit, e.g., making sure we don't copy content from them
during collapse ...

So let's keep it simple: just bail out if the page is lazyfree and clean :)

> 
> 
> Why don't we want to similarly handle this in 
> __collapse_huge_page_isolate() ?

Yeah, that should be added there as well.