From: Vernon Yang <yanglincheng@kylinos.cn>
For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly and then call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.
And if we collapse with a lazyfree page, that content will never be none
and the deferred shrinker cannot reclaim them.
So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.
Here are the performance test results:
(Throughput bigger is better, other smaller is better)
Testing on x86_64 machine:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.14 sec | 2.93 sec | -6.69% |
| cycles per access | 4.96 | 2.21 | -55.44% |
| Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
| dTLB-load-misses | 284814532 | 69597236 | -75.56% |
Testing on qemu-system-x86_64 -enable-kvm:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.35 sec | 2.96 sec | -11.64% |
| cycles per access | 7.29 | 2.07 | -71.60% |
| Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
| dTLB-load-misses | 241600871 | 3216108 | -98.67% |
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
---
include/trace/events/huge_memory.h | 1 +
mm/khugepaged.c | 13 +++++++++++++
2 files changed, 14 insertions(+)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 384e29f6bef0..bcdc57eea270 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
EM( SCAN_PAGE_LRU, "page_not_in_lru") \
EM( SCAN_PAGE_LOCK, "page_locked") \
EM( SCAN_PAGE_ANON, "page_not_anon") \
+ EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
EM( SCAN_PAGE_COMPOUND, "page_compound") \
EM( SCAN_ANY_PROCESS, "no_process_for_page") \
EM( SCAN_VMA_NULL, "vma_null") \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8b68ae3bc2c5..0d160e612e16 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -46,6 +46,7 @@ enum scan_result {
SCAN_PAGE_LRU,
SCAN_PAGE_LOCK,
SCAN_PAGE_ANON,
+ SCAN_PAGE_LAZYFREE,
SCAN_PAGE_COMPOUND,
SCAN_ANY_PROCESS,
SCAN_VMA_NULL,
@@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
folio = page_folio(page);
VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+ if (cc->is_khugepaged && !pte_dirty(pteval) &&
+ folio_test_lazyfree(folio)) {
+ result = SCAN_PAGE_LAZYFREE;
+ goto out;
+ }
+
/* See hpage_collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
@@ -1335,6 +1342,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
}
folio = page_folio(page);
+ if (cc->is_khugepaged && !pte_dirty(pteval) &&
+ folio_test_lazyfree(folio)) {
+ result = SCAN_PAGE_LAZYFREE;
+ goto out_unmap;
+ }
+
if (!folio_test_anon(folio)) {
result = SCAN_PAGE_ANON;
goto out_unmap;
--
2.51.0
On Sat, Feb 7, 2026 at 4:16 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly and then call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
>
> And if we collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.
>
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
>
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
>
> Testing on x86_64 machine:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> | cycles per access | 4.96 | 2.21 | -55.44% |
> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>
> Testing on qemu-system-x86_64 -enable-kvm:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> | cycles per access | 7.29 | 2.07 | -71.60% |
> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> Acked-by: David Hildenbrand (arm) <david@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> ---
> include/trace/events/huge_memory.h | 1 +
> mm/khugepaged.c | 13 +++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 384e29f6bef0..bcdc57eea270 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
> EM( SCAN_PAGE_LRU, "page_not_in_lru") \
> EM( SCAN_PAGE_LOCK, "page_locked") \
> EM( SCAN_PAGE_ANON, "page_not_anon") \
> + EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
> EM( SCAN_PAGE_COMPOUND, "page_compound") \
> EM( SCAN_ANY_PROCESS, "no_process_for_page") \
> EM( SCAN_VMA_NULL, "vma_null") \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8b68ae3bc2c5..0d160e612e16 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
> SCAN_PAGE_LRU,
> SCAN_PAGE_LOCK,
> SCAN_PAGE_ANON,
> + SCAN_PAGE_LAZYFREE,
> SCAN_PAGE_COMPOUND,
> SCAN_ANY_PROCESS,
> SCAN_VMA_NULL,
> @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> folio = page_folio(page);
> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>
> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> + folio_test_lazyfree(folio)) {
We have two corner cases here:
1. Even if a lazyfree folio is dirty, if the VMA has the VM_DROPPABLE flag,
a lazyfree folio may still be dropped, even when its PTE is dirty.
2. GUP operation can cause a folio to become dirty.
I see the corner cases from try_to_unmap_one():
if (folio_test_dirty(folio) &&
!(vma->vm_flags & VM_DROPPABLE)) {
/*
* redirtied either using the
page table or a previously
* obtained GUP reference.
*/
set_ptes(mm, address,
pvmw.pte, pteval, nr_pages);
folio_set_swapbacked(folio);
goto walk_abort;
}
Should we take these two corner cases into account?
> + result = SCAN_PAGE_LAZYFREE;
> + goto out;
> + }
> +
> /* See hpage_collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> ++shared;
> @@ -1335,6 +1342,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> }
> folio = page_folio(page);
>
> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> + folio_test_lazyfree(folio)) {
> + result = SCAN_PAGE_LAZYFREE;
> + goto out_unmap;
> + }
> +
> if (!folio_test_anon(folio)) {
> result = SCAN_PAGE_ANON;
> goto out_unmap;
Thanks
Barry
On 2026/2/7 16:34, Barry Song wrote:
> On Sat, Feb 7, 2026 at 4:16 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>>
>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>
>> For example, create three task: hot1 -> cold -> hot2. After all three
>> task are created, each allocate memory 128MB. the hot1/hot2 task
>> continuously access 128 MB memory, while the cold task only accesses
>> its memory briefly and then call madvise(MADV_FREE). However, khugepaged
>> still prioritizes scanning the cold task and only scans the hot2 task
>> after completing the scan of the cold task.
>>
>> And if we collapse with a lazyfree page, that content will never be none
>> and the deferred shrinker cannot reclaim them.
>>
>> So if the user has explicitly informed us via MADV_FREE that this memory
>> will be freed, it is appropriate for khugepaged to skip it only, thereby
>> avoiding unnecessary scan and collapse operations to reducing CPU
>> wastage.
>>
>> Here are the performance test results:
>> (Throughput bigger is better, other smaller is better)
>>
>> Testing on x86_64 machine:
>>
>> | task hot2 | without patch | with patch | delta |
>> |---------------------|---------------|---------------|---------|
>> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
>> | cycles per access | 4.96 | 2.21 | -55.44% |
>> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
>> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>>
>> Testing on qemu-system-x86_64 -enable-kvm:
>>
>> | task hot2 | without patch | with patch | delta |
>> |---------------------|---------------|---------------|---------|
>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>> | cycles per access | 7.29 | 2.07 | -71.60% |
>> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
>> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>>
>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>> Acked-by: David Hildenbrand (arm) <david@kernel.org>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> include/trace/events/huge_memory.h | 1 +
>> mm/khugepaged.c | 13 +++++++++++++
>> 2 files changed, 14 insertions(+)
>>
>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>> index 384e29f6bef0..bcdc57eea270 100644
>> --- a/include/trace/events/huge_memory.h
>> +++ b/include/trace/events/huge_memory.h
>> @@ -25,6 +25,7 @@
>> EM( SCAN_PAGE_LRU, "page_not_in_lru") \
>> EM( SCAN_PAGE_LOCK, "page_locked") \
>> EM( SCAN_PAGE_ANON, "page_not_anon") \
>> + EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
>> EM( SCAN_PAGE_COMPOUND, "page_compound") \
>> EM( SCAN_ANY_PROCESS, "no_process_for_page") \
>> EM( SCAN_VMA_NULL, "vma_null") \
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 8b68ae3bc2c5..0d160e612e16 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -46,6 +46,7 @@ enum scan_result {
>> SCAN_PAGE_LRU,
>> SCAN_PAGE_LOCK,
>> SCAN_PAGE_ANON,
>> + SCAN_PAGE_LAZYFREE,
>> SCAN_PAGE_COMPOUND,
>> SCAN_ANY_PROCESS,
>> SCAN_VMA_NULL,
>> @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> folio = page_folio(page);
>> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>
>> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
>> + folio_test_lazyfree(folio)) {
>
> We have two corner cases here:
Good catch!
>
> 1. Even if a lazyfree folio is dirty, if the VMA has the VM_DROPPABLE flag,
> a lazyfree folio may still be dropped, even when its PTE is dirty.
Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio
regardless of whether it (or the PTE) is dirty in try_to_unmap_one().
So, IMHO, we could go with:
cc->is_khugepaged && folio_test_lazyfree(folio) &&
(!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE))
>
> 2. GUP operation can cause a folio to become dirty.
Emm... I don't think we need to do anything special for GUP here :)
IIUC, if the range is pinned, MADV_COLLAPSE/khugepaged already fails;
We hit the refcount check in hpage_collapse_scan_pmd() (expected vs
actual refcount) and return -EAGAIN.
```
/*
* Check if the page has any GUP (or other external) pins.
*
* Here the check may be racy:
* it may see folio_mapcount() > folio_ref_count().
* But such case is ephemeral we could always retry collapse
* later. However it may report false positive if the page
* has excessive GUP pins (i.e. 512). Anyway the same check
* will be done again later the risk seems low.
*/
if (folio_expected_ref_count(folio) != folio_ref_count(folio)) {
result = SCAN_PAGE_COUNT;
goto out_unmap;
}
```
Cheers,
Lance
>
> I see the corner cases from try_to_unmap_one():
>
> if (folio_test_dirty(folio) &&
> !(vma->vm_flags & VM_DROPPABLE)) {
> /*
> * redirtied either using the
> page table or a previously
> * obtained GUP reference.
> */
> set_ptes(mm, address,
> pvmw.pte, pteval, nr_pages);
> folio_set_swapbacked(folio);
> goto walk_abort;
> }
>
> Should we take these two corner cases into account?
>
>
>> + result = SCAN_PAGE_LAZYFREE;
>> + goto out;
>> + }
>> +
>> /* See hpage_collapse_scan_pmd(). */
>> if (folio_maybe_mapped_shared(folio)) {
>> ++shared;
>> @@ -1335,6 +1342,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>> }
>> folio = page_folio(page);
>>
>> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
>> + folio_test_lazyfree(folio)) {
>> + result = SCAN_PAGE_LAZYFREE;
>> + goto out_unmap;
>> + }
>> +
>> if (!folio_test_anon(folio)) {
>> result = SCAN_PAGE_ANON;
>> goto out_unmap;
>
> Thanks
> Barry
On 2/7/26 14:51, Lance Yang wrote:
>
>
> On 2026/2/7 16:34, Barry Song wrote:
>> On Sat, Feb 7, 2026 at 4:16 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>>>
>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>
>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>> continuously access 128 MB memory, while the cold task only accesses
>>> its memory briefly and then call madvise(MADV_FREE). However, khugepaged
>>> still prioritizes scanning the cold task and only scans the hot2 task
>>> after completing the scan of the cold task.
>>>
>>> And if we collapse with a lazyfree page, that content will never be none
>>> and the deferred shrinker cannot reclaim them.
>>>
>>> So if the user has explicitly informed us via MADV_FREE that this memory
>>> will be freed, it is appropriate for khugepaged to skip it only, thereby
>>> avoiding unnecessary scan and collapse operations to reducing CPU
>>> wastage.
>>>
>>> Here are the performance test results:
>>> (Throughput bigger is better, other smaller is better)
>>>
>>> Testing on x86_64 machine:
>>>
>>> | task hot2 | without patch | with patch | delta |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
>>> | cycles per access | 4.96 | 2.21 | -55.44% |
>>> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
>>> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>>>
>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>
>>> | task hot2 | without patch | with patch | delta |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>>> | cycles per access | 7.29 | 2.07 | -71.60% |
>>> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
>>> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> Acked-by: David Hildenbrand (arm) <david@kernel.org>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>> include/trace/events/huge_memory.h | 1 +
>>> mm/khugepaged.c | 13 +++++++++++++
>>> 2 files changed, 14 insertions(+)
>>>
>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/
>>> events/huge_memory.h
>>> index 384e29f6bef0..bcdc57eea270 100644
>>> --- a/include/trace/events/huge_memory.h
>>> +++ b/include/trace/events/huge_memory.h
>>> @@ -25,6 +25,7 @@
>>> EM( SCAN_PAGE_LRU,
>>> "page_not_in_lru") \
>>> EM( SCAN_PAGE_LOCK,
>>> "page_locked") \
>>> EM( SCAN_PAGE_ANON,
>>> "page_not_anon") \
>>> + EM( SCAN_PAGE_LAZYFREE,
>>> "page_lazyfree") \
>>> EM( SCAN_PAGE_COMPOUND,
>>> "page_compound") \
>>> EM( SCAN_ANY_PROCESS,
>>> "no_process_for_page") \
>>> EM( SCAN_VMA_NULL,
>>> "vma_null") \
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 8b68ae3bc2c5..0d160e612e16 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -46,6 +46,7 @@ enum scan_result {
>>> SCAN_PAGE_LRU,
>>> SCAN_PAGE_LOCK,
>>> SCAN_PAGE_ANON,
>>> + SCAN_PAGE_LAZYFREE,
>>> SCAN_PAGE_COMPOUND,
>>> SCAN_ANY_PROCESS,
>>> SCAN_VMA_NULL,
>>> @@ -583,6 +584,12 @@ static enum scan_result
>>> __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> folio = page_folio(page);
>>> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>>
>>> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
>>> + folio_test_lazyfree(folio)) {
>>
>> We have two corner cases here:
>
> Good catch!
>
>>
>> 1. Even if a lazyfree folio is dirty, if the VMA has the VM_DROPPABLE
>> flag,
>> a lazyfree folio may still be dropped, even when its PTE is dirty.
Good point!
>
> Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio
> regardless of whether it (or the PTE) is dirty in try_to_unmap_one().
>
> So, IMHO, we could go with:
>
> cc->is_khugepaged && folio_test_lazyfree(folio) &&
> (!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE))
Hm. In a VM_DROPPABLE mapping all folios should be marked as lazy-free
(see folio_add_new_anon_rmap()).
The new (collapse) folio will also be marked lazy (due to
folio_add_new_anon_rmap()) free and can just get dropped any time.
So likely we should just not skip collapse for lazyfree folios in
VM_DROPPABLE mappings?
if (cc->is_khugepaged && !(vma->vm_flags & VM_DROPPABLE) &&
folio_test_lazyfree(folio) && !pte_dirty(pteval)) {
...
}
--
Cheers,
David
On 2026/2/8 05:38, David Hildenbrand (Arm) wrote:
> On 2/7/26 14:51, Lance Yang wrote:
>>
>>
>> On 2026/2/7 16:34, Barry Song wrote:
>>> On Sat, Feb 7, 2026 at 4:16 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>>>>
>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>>
>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>> continuously access 128 MB memory, while the cold task only accesses
>>>> its memory briefly and then call madvise(MADV_FREE). However,
>>>> khugepaged
>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>> after completing the scan of the cold task.
>>>>
>>>> And if we collapse with a lazyfree page, that content will never be
>>>> none
>>>> and the deferred shrinker cannot reclaim them.
>>>>
>>>> So if the user has explicitly informed us via MADV_FREE that this
>>>> memory
>>>> will be freed, it is appropriate for khugepaged to skip it only,
>>>> thereby
>>>> avoiding unnecessary scan and collapse operations to reducing CPU
>>>> wastage.
>>>>
>>>> Here are the performance test results:
>>>> (Throughput bigger is better, other smaller is better)
>>>>
>>>> Testing on x86_64 machine:
>>>>
>>>> | task hot2 | without patch | with patch | delta |
>>>> |---------------------|---------------|---------------|---------|
>>>> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
>>>> | cycles per access | 4.96 | 2.21 | -55.44% |
>>>> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
>>>> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>>>>
>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>
>>>> | task hot2 | without patch | with patch | delta |
>>>> |---------------------|---------------|---------------|---------|
>>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>>>> | cycles per access | 7.29 | 2.07 | -71.60% |
>>>> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
>>>> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>>>>
>>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>>> Acked-by: David Hildenbrand (arm) <david@kernel.org>
>>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>> include/trace/events/huge_memory.h | 1 +
>>>> mm/khugepaged.c | 13 +++++++++++++
>>>> 2 files changed, 14 insertions(+)
>>>>
>>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/
>>>> events/huge_memory.h
>>>> index 384e29f6bef0..bcdc57eea270 100644
>>>> --- a/include/trace/events/huge_memory.h
>>>> +++ b/include/trace/events/huge_memory.h
>>>> @@ -25,6 +25,7 @@
>>>> EM( SCAN_PAGE_LRU, "page_not_in_lru") \
>>>> EM( SCAN_PAGE_LOCK, "page_locked") \
>>>> EM( SCAN_PAGE_ANON, "page_not_anon") \
>>>> + EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
>>>> EM( SCAN_PAGE_COMPOUND, "page_compound") \
>>>> EM( SCAN_ANY_PROCESS, "no_process_for_page") \
>>>> EM( SCAN_VMA_NULL, "vma_null") \
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 8b68ae3bc2c5..0d160e612e16 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -46,6 +46,7 @@ enum scan_result {
>>>> SCAN_PAGE_LRU,
>>>> SCAN_PAGE_LOCK,
>>>> SCAN_PAGE_ANON,
>>>> + SCAN_PAGE_LAZYFREE,
>>>> SCAN_PAGE_COMPOUND,
>>>> SCAN_ANY_PROCESS,
>>>> SCAN_VMA_NULL,
>>>> @@ -583,6 +584,12 @@ static enum scan_result
>>>> __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>> folio = page_folio(page);
>>>> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>>>
>>>> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
>>>> + folio_test_lazyfree(folio)) {
>>>
>>> We have two corner cases here:
>>
>> Good catch!
>>
>>>
>>> 1. Even if a lazyfree folio is dirty, if the VMA has the VM_DROPPABLE
>>> flag,
>>> a lazyfree folio may still be dropped, even when its PTE is dirty.
>
> Good point!
>
>>
>> Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio
>> regardless of whether it (or the PTE) is dirty in try_to_unmap_one().
>>
>> So, IMHO, we could go with:
>>
>> cc->is_khugepaged && folio_test_lazyfree(folio) &&
>> (!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE))
>
> Hm. In a VM_DROPPABLE mapping all folios should be marked as lazy-free
> (see folio_add_new_anon_rmap()).
Ah, I missed that apparently :)
> The new (collapse) folio will also be marked lazy (due to
> folio_add_new_anon_rmap()) free and can just get dropped any time.
>
> So likely we should just not skip collapse for lazyfree folios in
> VM_DROPPABLE mappings?
>
> if (cc->is_khugepaged && !(vma->vm_flags & VM_DROPPABLE) &&
> folio_test_lazyfree(folio) && !pte_dirty(pteval)) {
> ...
> }
Yep. That should be doing the trick. Thanks!
On Sun, Feb 8, 2026 at 5:38 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 2/7/26 14:51, Lance Yang wrote:
> >
> >
> > On 2026/2/7 16:34, Barry Song wrote:
> >> On Sat, Feb 7, 2026 at 4:16 PM Vernon Yang <vernon2gm@gmail.com> wrote:
> >>>
> >>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>
> >>> For example, create three task: hot1 -> cold -> hot2. After all three
> >>> task are created, each allocate memory 128MB. the hot1/hot2 task
> >>> continuously access 128 MB memory, while the cold task only accesses
> >>> its memory briefly and then call madvise(MADV_FREE). However, khugepaged
> >>> still prioritizes scanning the cold task and only scans the hot2 task
> >>> after completing the scan of the cold task.
> >>>
> >>> And if we collapse with a lazyfree page, that content will never be none
> >>> and the deferred shrinker cannot reclaim them.
> >>>
> >>> So if the user has explicitly informed us via MADV_FREE that this memory
> >>> will be freed, it is appropriate for khugepaged to skip it only, thereby
> >>> avoiding unnecessary scan and collapse operations to reducing CPU
> >>> wastage.
> >>>
> >>> Here are the performance test results:
> >>> (Throughput bigger is better, other smaller is better)
> >>>
> >>> Testing on x86_64 machine:
> >>>
> >>> | task hot2 | without patch | with patch | delta |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> >>> | cycles per access | 4.96 | 2.21 | -55.44% |
> >>> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> >>> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
> >>>
> >>> Testing on qemu-system-x86_64 -enable-kvm:
> >>>
> >>> | task hot2 | without patch | with patch | delta |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> >>> | cycles per access | 7.29 | 2.07 | -71.60% |
> >>> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> >>> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
> >>>
> >>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >>> Acked-by: David Hildenbrand (arm) <david@kernel.org>
> >>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> >>> ---
> >>> include/trace/events/huge_memory.h | 1 +
> >>> mm/khugepaged.c | 13 +++++++++++++
> >>> 2 files changed, 14 insertions(+)
> >>>
> >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/
> >>> events/huge_memory.h
> >>> index 384e29f6bef0..bcdc57eea270 100644
> >>> --- a/include/trace/events/huge_memory.h
> >>> +++ b/include/trace/events/huge_memory.h
> >>> @@ -25,6 +25,7 @@
> >>> EM( SCAN_PAGE_LRU,
> >>> "page_not_in_lru") \
> >>> EM( SCAN_PAGE_LOCK,
> >>> "page_locked") \
> >>> EM( SCAN_PAGE_ANON,
> >>> "page_not_anon") \
> >>> + EM( SCAN_PAGE_LAZYFREE,
> >>> "page_lazyfree") \
> >>> EM( SCAN_PAGE_COMPOUND,
> >>> "page_compound") \
> >>> EM( SCAN_ANY_PROCESS,
> >>> "no_process_for_page") \
> >>> EM( SCAN_VMA_NULL,
> >>> "vma_null") \
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 8b68ae3bc2c5..0d160e612e16 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -46,6 +46,7 @@ enum scan_result {
> >>> SCAN_PAGE_LRU,
> >>> SCAN_PAGE_LOCK,
> >>> SCAN_PAGE_ANON,
> >>> + SCAN_PAGE_LAZYFREE,
> >>> SCAN_PAGE_COMPOUND,
> >>> SCAN_ANY_PROCESS,
> >>> SCAN_VMA_NULL,
> >>> @@ -583,6 +584,12 @@ static enum scan_result
> >>> __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>> folio = page_folio(page);
> >>> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >>>
> >>> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> >>> + folio_test_lazyfree(folio)) {
> >>
> >> We have two corner cases here:
> >
> > Good catch!
> >
> >>
> >> 1. Even if a lazyfree folio is dirty, if the VMA has the VM_DROPPABLE
> >> flag,
> >> a lazyfree folio may still be dropped, even when its PTE is dirty.
>
> Good point!
>
> >
> > Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio
> > regardless of whether it (or the PTE) is dirty in try_to_unmap_one().
> >
> > So, IMHO, we could go with:
> >
> > cc->is_khugepaged && folio_test_lazyfree(folio) &&
> > (!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE))
>
> Hm. In a VM_DROPPABLE mapping all folios should be marked as lazy-free
> (see folio_add_new_anon_rmap()).
>
> The new (collapse) folio will also be marked lazy (due to
> folio_add_new_anon_rmap()) free and can just get dropped any time.
>
> So likely we should just not skip collapse for lazyfree folios in
> VM_DROPPABLE mappings?
Maybe change “just not skip” to “just skip”?
If the goal is to avoid the collapse overhead for folios that are
about to be dropped, we might consider skipping collapse for the
entire VMA?
>
> if (cc->is_khugepaged && !(vma->vm_flags & VM_DROPPABLE) &&
> folio_test_lazyfree(folio) && !pte_dirty(pteval)) {
> ...
> }
Thanks
Barry
On 2/7/26 23:01, Barry Song wrote: > On Sun, Feb 8, 2026 at 5:38 AM David Hildenbrand (Arm) <david@kernel.org> wrote: >> >> On 2/7/26 14:51, Lance Yang wrote: >>> >>> >>> >>> Good catch! >>> >> >> Good point! >> >>> >>> Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio >>> regardless of whether it (or the PTE) is dirty in try_to_unmap_one(). >>> >>> So, IMHO, we could go with: >>> >>> cc->is_khugepaged && folio_test_lazyfree(folio) && >>> (!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE)) >> >> Hm. In a VM_DROPPABLE mapping all folios should be marked as lazy-free >> (see folio_add_new_anon_rmap()). >> >> The new (collapse) folio will also be marked lazy (due to >> folio_add_new_anon_rmap()) free and can just get dropped any time. >> >> So likely we should just not skip collapse for lazyfree folios in >> VM_DROPPABLE mappings? > > Maybe change “just not skip” to “just skip”? > > If the goal is to avoid the collapse overhead for folios that are > about to be dropped, we might consider skipping collapse for the > entire VMA? If there is no memory pressure in the system, why wouldn't you just want to collapse in a VM_DROPPABLE region? "about to be dropped" only applies once there is actual memory pressure. If not, these pages stick around forever. -- Cheers, David
On Sun, Feb 8, 2026 at 6:05 AM David Hildenbrand (Arm) <david@kernel.org> wrote: > > On 2/7/26 23:01, Barry Song wrote: > > On Sun, Feb 8, 2026 at 5:38 AM David Hildenbrand (Arm) <david@kernel.org> wrote: > >> > >> On 2/7/26 14:51, Lance Yang wrote: > >>> > >>> > >>> > >>> Good catch! > >>> > >> > >> Good point! > >> > >>> > >>> Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio > >>> regardless of whether it (or the PTE) is dirty in try_to_unmap_one(). > >>> > >>> So, IMHO, we could go with: > >>> > >>> cc->is_khugepaged && folio_test_lazyfree(folio) && > >>> (!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE)) > >> > >> Hm. In a VM_DROPPABLE mapping all folios should be marked as lazy-free > >> (see folio_add_new_anon_rmap()). > >> > >> The new (collapse) folio will also be marked lazy (due to > >> folio_add_new_anon_rmap()) free and can just get dropped any time. > >> > >> So likely we should just not skip collapse for lazyfree folios in > >> VM_DROPPABLE mappings? > > > > Maybe change “just not skip” to “just skip”? > > > > If the goal is to avoid the collapse overhead for folios that are > > about to be dropped, we might consider skipping collapse for the > > entire VMA? > If there is no memory pressure in the system, why wouldn't you just want > to collapse in a VM_DROPPABLE region? > > "about to be dropped" only applies once there is actual memory pressure. > If not, these pages stick around forever. agree. But this brings us back to the philosophy of the original patch. If there is no memory pressure, lazyfree folios won’t be dropped, so collapsing them might also be reasonable. Just collapsing fully lazyfree folios with VM_DROPPABLE while skipping partially lazyfree VMAs seems a bit confusing to me :-) Thanks Barry
On 2/7/26 23:17, Barry Song wrote: > On Sun, Feb 8, 2026 at 6:05 AM David Hildenbrand (Arm) <david@kernel.org> wrote: >> >> On 2/7/26 23:01, Barry Song wrote: >>> >>> Maybe change “just not skip” to “just skip”? >>> >>> If the goal is to avoid the collapse overhead for folios that are >>> about to be dropped, we might consider skipping collapse for the >>> entire VMA? >> If there is no memory pressure in the system, why wouldn't you just want >> to collapse in a VM_DROPPABLE region? >> >> "about to be dropped" only applies once there is actual memory pressure. >> If not, these pages stick around forever. > > agree. But this brings us back to the philosophy of the original patch. > If there is no memory pressure, lazyfree folios won’t be dropped, so > collapsing them might also be reasonable. It's about memory pressure in the future. > > Just collapsing fully lazyfree folios with VM_DROPPABLE while > skipping partially lazyfree VMAs seems a bit confusing to me :-) Think of it like this: All folios in VM_DROPPABLE are lazyfree. Collapsing maintains that property. So you can just collapse and memory pressure in the future will free it up. In contrast, collapsing in !VM_DROPPABLE does not maintain that property. The collapsed folio will not be lazyfree and memory pressure in the future will not be able to free it up. -- Cheers, David
On Sun, Feb 8, 2026 at 6:25 AM David Hildenbrand (Arm) <david@kernel.org> wrote: > > On 2/7/26 23:17, Barry Song wrote: > > On Sun, Feb 8, 2026 at 6:05 AM David Hildenbrand (Arm) <david@kernel.org> wrote: > >> > >> On 2/7/26 23:01, Barry Song wrote: > >>> > >>> Maybe change “just not skip” to “just skip”? > >>> > >>> If the goal is to avoid the collapse overhead for folios that are > >>> about to be dropped, we might consider skipping collapse for the > >>> entire VMA? > >> If there is no memory pressure in the system, why wouldn't you just want > >> to collapse in a VM_DROPPABLE region? > >> > >> "about to be dropped" only applies once there is actual memory pressure. > >> If not, these pages stick around forever. > > > > agree. But this brings us back to the philosophy of the original patch. > > If there is no memory pressure, lazyfree folios won’t be dropped, so > > collapsing them might also be reasonable. > > It's about memory pressure in the future. > > > > > Just collapsing fully lazyfree folios with VM_DROPPABLE while > > skipping partially lazyfree VMAs seems a bit confusing to me :-) > > Think of it like this: > > All folios in VM_DROPPABLE are lazyfree. Collapsing maintains that > property. So you can just collapse and memory pressure in the future > will free it up. > > In contrast, collapsing in !VM_DROPPABLE does not maintain that > property. The collapsed folio will not be lazyfree and memory pressure > in the future will not be able to free it up. Thank you Barry for pointing out this corner case, and thank you David for suggestions and explanations. LGTM, I will fix it in the next version. --- Thanks, Vernon
On Sun, Feb 8, 2026 at 6:25 AM David Hildenbrand (Arm) <david@kernel.org> wrote: > > On 2/7/26 23:17, Barry Song wrote: > > On Sun, Feb 8, 2026 at 6:05 AM David Hildenbrand (Arm) <david@kernel.org> wrote: > >> > >> On 2/7/26 23:01, Barry Song wrote: > >>> > >>> Maybe change “just not skip” to “just skip”? > >>> > >>> If the goal is to avoid the collapse overhead for folios that are > >>> about to be dropped, we might consider skipping collapse for the > >>> entire VMA? > >> If there is no memory pressure in the system, why wouldn't you just want > >> to collapse in a VM_DROPPABLE region? > >> > >> "about to be dropped" only applies once there is actual memory pressure. > >> If not, these pages stick around forever. > > > > agree. But this brings us back to the philosophy of the original patch. > > If there is no memory pressure, lazyfree folios won’t be dropped, so > > collapsing them might also be reasonable. > > It's about memory pressure in the future. > > > > > Just collapsing fully lazyfree folios with VM_DROPPABLE while > > skipping partially lazyfree VMAs seems a bit confusing to me :-) > > Think of it like this: > > All folios in VM_DROPPABLE are lazyfree. Collapsing maintains that > property. So you can just collapse and memory pressure in the future > will free it up. > > In contrast, collapsing in !VM_DROPPABLE does not maintain that > property. The collapsed folio will not be lazyfree and memory pressure > in the future will not be able to free it up. Thanks for the clarification. I agree with your point — whether lazyfree folios are carried over to the new folios changes the whole story. Best Regards Barry
© 2016 - 2026 Red Hat, Inc.