[v1] Improve khugepaged scan logic

[PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Vernon Yang 1 month, 3 weeks ago

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

So if the user has explicitly informed us via MADV_COLD/FREE that this
memory is cold or will be freed, it is appropriate for khugepaged to
scan it only at the latest possible moment, thereby avoiding unnecessary
scan and collapse operations to reducing CPU wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
| cycles per access   |  4.91         |  2.07         | -57.84% |
| Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
| dTLB-load-misses    |  288966432    |  1292908      | -99.55% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.23         |  2.12         | -70.68% |
| Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
| dTLB-load-misses    |  237406497    |  3189194      | -98.66% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/khugepaged.h |  1 +
 mm/khugepaged.c            | 14 ++++++++++++++
 mm/madvise.c               |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..726e99de84e9 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,7 @@ extern void __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern void khugepaged_enter_vma(struct vm_area_struct *vma,
 				 vm_flags_t vm_flags);
+void khugepaged_move_tail(struct mm_struct *mm);
 extern void khugepaged_min_free_kbytes_update(void);
 extern bool current_is_khugepaged(void);
 extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ec1af5be3c8..91836dda2015 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -468,6 +468,20 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 	}
 }
 
+void khugepaged_move_tail(struct mm_struct *mm)
+{
+	struct mm_slot *slot;
+
+	if (!mm_flags_test(MMF_VM_HUGEPAGE, mm))
+		return;
+
+	spin_lock(&khugepaged_mm_lock);
+	slot = mm_slot_lookup(mm_slots_hash, mm);
+	if (slot && khugepaged_scan.mm_slot != slot)
+		list_move_tail(&slot->mm_node, &khugepaged_scan.mm_head);
+	spin_unlock(&khugepaged_mm_lock);
+}
+
 void __khugepaged_exit(struct mm_struct *mm)
 {
 	struct mm_slot *slot;
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..3f9ca7af2c82 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -608,6 +608,8 @@ static long madvise_cold(struct madvise_behavior *madv_behavior)
 	madvise_cold_page_range(&tlb, madv_behavior);
 	tlb_finish_mmu(&tlb);
 
+	khugepaged_move_tail(vma->vm_mm);
+
 	return 0;
 }
 
@@ -835,6 +837,7 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior)
 			&walk_ops, tlb);
 	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
+	khugepaged_move_tail(mm);
 	return 0;
 }
 
-- 
2.51.0

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by David Hildenbrand (Red Hat) 1 month, 3 weeks ago

On 12/15/25 10:04, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> So if the user has explicitly informed us via MADV_COLD/FREE that this
> memory is cold or will be freed, it is appropriate for khugepaged to
> scan it only at the latest possible moment, thereby avoiding unnecessary
> scan and collapse operations to reducing CPU wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> | cycles per access   |  4.91         |  2.07         | -57.84% |
> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.23         |  2.12         | -70.68% |
> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |

Again, I also don't like that because you make assumptions on a full 
process based on some part of it's address space.

E.g., if a library issues a MADV_COLD on some part of the memory the 
library manages, why should the remaining part of the process suffer as 
well?

This seems to be an heuristic focused on some specific workloads, no?

-- 
Cheers

David

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Vernon Yang 1 month, 3 weeks ago

On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > memory is cold or will be freed, it is appropriate for khugepaged to
> > scan it only at the latest possible moment, thereby avoiding unnecessary
> > scan and collapse operations to reducing CPU wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> > | cycles per access   |  4.91         |  2.07         | -57.84% |
> > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.23         |  2.12         | -70.68% |
> > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>
> Again, I also don't like that because you make assumptions on a full process
> based on some part of it's address space.
>
> E.g., if a library issues a MADV_COLD on some part of the memory the library
> manages, why should the remaining part of the process suffer as well?

Yes, you make a good point, thanks!

> This seems to be an heuristic focused on some specific workloads, no?

Right.

Could we use the VM_NOHUGEPAGE flag to indicate that this region should
not be collapsed, so that khugepaged can simply skip this VMA during
scanning? This way, it won't affect the remaining part of the task's
memory regions.

> --
> Cheers
>
> David

--
Thanks,
Vernon

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by David Hildenbrand (Red Hat) 1 month, 3 weeks ago

On 12/19/25 06:29, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>> continuously access 128 MB memory, while the cold task only accesses
>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>> still prioritizes scanning the cold task and only scans the hot2 task
>>> after completing the scan of the cold task.
>>>
>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>> scan and collapse operations to reducing CPU wastage.
>>>
>>> Here are the performance test results:
>>> (Throughput bigger is better, other smaller is better)
>>>
>>> Testing on x86_64 machine:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>>> | cycles per access   |  4.91         |  2.07         | -57.84% |
>>> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>>> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>>>
>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>> | cycles per access   |  7.23         |  2.12         | -70.68% |
>>> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>>> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>>
>> Again, I also don't like that because you make assumptions on a full process
>> based on some part of it's address space.
>>
>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>> manages, why should the remaining part of the process suffer as well?
> 
> Yes, you make a good point, thanks!
> 
>> This seems to be an heuristic focused on some specific workloads, no?
> 
> Right.
> 
> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> not be collapsed, so that khugepaged can simply skip this VMA during
> scanning? This way, it won't affect the remaining part of the task's
> memory regions.

I thought we would skip these regions already properly in khugeapged, or 
maybe I misunderstood your question.

-- 
Cheers

David

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Wei Yang 1 month, 3 weeks ago

On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>On 12/19/25 06:29, Vernon Yang wrote:
>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> > On 12/15/25 10:04, Vernon Yang wrote:
>> > > For example, create three task: hot1 -> cold -> hot2. After all three
>> > > task are created, each allocate memory 128MB. the hot1/hot2 task
>> > > continuously access 128 MB memory, while the cold task only accesses
>> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>> > > still prioritizes scanning the cold task and only scans the hot2 task
>> > > after completing the scan of the cold task.
>> > > 
>> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
>> > > memory is cold or will be freed, it is appropriate for khugepaged to
>> > > scan it only at the latest possible moment, thereby avoiding unnecessary
>> > > scan and collapse operations to reducing CPU wastage.
>> > > 
>> > > Here are the performance test results:
>> > > (Throughput bigger is better, other smaller is better)
>> > > 
>> > > Testing on x86_64 machine:
>> > > 
>> > > | task hot2           | without patch | with patch    |  delta  |
>> > > |---------------------|---------------|---------------|---------|
>> > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>> > > | cycles per access   |  4.91         |  2.07         | -57.84% |
>> > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>> > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>> > > 
>> > > Testing on qemu-system-x86_64 -enable-kvm:
>> > > 
>> > > | task hot2           | without patch | with patch    |  delta  |
>> > > |---------------------|---------------|---------------|---------|
>> > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>> > > | cycles per access   |  7.23         |  2.12         | -70.68% |
>> > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>> > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>> > 
>> > Again, I also don't like that because you make assumptions on a full process
>> > based on some part of it's address space.
>> > 
>> > E.g., if a library issues a MADV_COLD on some part of the memory the library
>> > manages, why should the remaining part of the process suffer as well?
>> 
>> Yes, you make a good point, thanks!
>> 
>> > This seems to be an heuristic focused on some specific workloads, no?
>> 
>> Right.
>> 
>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>> not be collapsed, so that khugepaged can simply skip this VMA during
>> scanning? This way, it won't affect the remaining part of the task's
>> memory regions.
>
>I thought we would skip these regions already properly in khugeapged, or
>maybe I misunderstood your question.
>

I think we should, but seems we didn't do this for anonymous memory during
khugepaged.

We check the vma with thp_vma_allowable_order() during scan.

  * For anonymous memory during khugepaged, if we always enable 2M collapse,
    we will scan this vma. Even VM_NOHUGEPAGE is set.

  * For other cases, it looks good since __thp_vma_allowable_order() will skip
    this vma with vma_thp_disabled().

>-- 
>Cheers
>
>David

-- 
Wei Yang
Help you, Help me

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Vernon Yang 1 month, 3 weeks ago

On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> >On 12/19/25 06:29, Vernon Yang wrote:
> >> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> >> > On 12/15/25 10:04, Vernon Yang wrote:
> >> > > For example, create three task: hot1 -> cold -> hot2. After all three
> >> > > task are created, each allocate memory 128MB. the hot1/hot2 task
> >> > > continuously access 128 MB memory, while the cold task only accesses
> >> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> >> > > still prioritizes scanning the cold task and only scans the hot2 task
> >> > > after completing the scan of the cold task.
> >> > >
> >> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> >> > > memory is cold or will be freed, it is appropriate for khugepaged to
> >> > > scan it only at the latest possible moment, thereby avoiding unnecessary
> >> > > scan and collapse operations to reducing CPU wastage.
> >> > >
> >> > > Here are the performance test results:
> >> > > (Throughput bigger is better, other smaller is better)
> >> > >
> >> > > Testing on x86_64 machine:
> >> > >
> >> > > | task hot2           | without patch | with patch    |  delta  |
> >> > > |---------------------|---------------|---------------|---------|
> >> > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> >> > > | cycles per access   |  4.91         |  2.07         | -57.84% |
> >> > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> >> > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> >> > >
> >> > > Testing on qemu-system-x86_64 -enable-kvm:
> >> > >
> >> > > | task hot2           | without patch | with patch    |  delta  |
> >> > > |---------------------|---------------|---------------|---------|
> >> > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> >> > > | cycles per access   |  7.23         |  2.12         | -70.68% |
> >> > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> >> > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
> >> >
> >> > Again, I also don't like that because you make assumptions on a full process
> >> > based on some part of it's address space.
> >> >
> >> > E.g., if a library issues a MADV_COLD on some part of the memory the library
> >> > manages, why should the remaining part of the process suffer as well?
> >>
> >> Yes, you make a good point, thanks!
> >>
> >> > This seems to be an heuristic focused on some specific workloads, no?
> >>
> >> Right.
> >>
> >> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> >> not be collapsed, so that khugepaged can simply skip this VMA during
> >> scanning? This way, it won't affect the remaining part of the task's
> >> memory regions.
> >
> >I thought we would skip these regions already properly in khugeapged, or
> >maybe I misunderstood your question.
> >
>
> I think we should, but seems we didn't do this for anonymous memory during
> khugepaged.
>
> We check the vma with thp_vma_allowable_order() during scan.
>
>   * For anonymous memory during khugepaged, if we always enable 2M collapse,
>     we will scan this vma. Even VM_NOHUGEPAGE is set.
>
>   * For other cases, it looks good since __thp_vma_allowable_order() will skip
>     this vma with vma_thp_disabled().

Hi David, Wei,

The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
memory during scan, as below:

khugepaged_scan_mm_slot()
    thp_vma_allowable_order()
        thp_vma_allowable_orders()
            __thp_vma_allowable_orders()
                vma_thp_disabled() {
                     if (vm_flags & VM_NOHUGEPAGE)
                         return true;
                }

REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
so the khugepaged will continue scan this vma.

I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
been successful. I will send it in the next version.

--
Thanks,
Vernon

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by David Hildenbrand (Red Hat) 1 month, 2 weeks ago

On 12/21/25 05:25, Vernon Yang wrote:
> On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>>> On 12/19/25 06:29, Vernon Yang wrote:
>>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/15/25 10:04, Vernon Yang wrote:
>>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>>> after completing the scan of the cold task.
>>>>>>
>>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>>>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>>>>> scan and collapse operations to reducing CPU wastage.
>>>>>>
>>>>>> Here are the performance test results:
>>>>>> (Throughput bigger is better, other smaller is better)
>>>>>>
>>>>>> Testing on x86_64 machine:
>>>>>>
>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>> |---------------------|---------------|---------------|---------|
>>>>>> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>>>>>> | cycles per access   |  4.91         |  2.07         | -57.84% |
>>>>>> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>>>>>> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>>>>>>
>>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>>
>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>> |---------------------|---------------|---------------|---------|
>>>>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>>>>> | cycles per access   |  7.23         |  2.12         | -70.68% |
>>>>>> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>>>>>> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>>>>>
>>>>> Again, I also don't like that because you make assumptions on a full process
>>>>> based on some part of it's address space.
>>>>>
>>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>>>>> manages, why should the remaining part of the process suffer as well?
>>>>
>>>> Yes, you make a good point, thanks!
>>>>
>>>>> This seems to be an heuristic focused on some specific workloads, no?
>>>>
>>>> Right.
>>>>
>>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>>>> not be collapsed, so that khugepaged can simply skip this VMA during
>>>> scanning? This way, it won't affect the remaining part of the task's
>>>> memory regions.
>>>
>>> I thought we would skip these regions already properly in khugeapged, or
>>> maybe I misunderstood your question.
>>>
>>
>> I think we should, but seems we didn't do this for anonymous memory during
>> khugepaged.
>>
>> We check the vma with thp_vma_allowable_order() during scan.
>>
>>    * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>      we will scan this vma. Even VM_NOHUGEPAGE is set.
>>
>>    * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>      this vma with vma_thp_disabled().
> 
> Hi David, Wei,
> 
> The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> memory during scan, as below:
> 
> khugepaged_scan_mm_slot()
>      thp_vma_allowable_order()
>          thp_vma_allowable_orders()
>              __thp_vma_allowable_orders()
>                  vma_thp_disabled() {
>                       if (vm_flags & VM_NOHUGEPAGE)
>                           return true;
>                  }
> 
> REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
> so the khugepaged will continue scan this vma.
> 
> I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> been successful. I will send it in the next version.

No we must not do that. That's a user-space visible change. :/

-- 
Cheers

David

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Vernon Yang 1 month, 2 weeks ago

On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/21/25 05:25, Vernon Yang wrote:
> > On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> > > On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > On 12/19/25 06:29, Vernon Yang wrote:
> > > > > On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > On 12/15/25 10:04, Vernon Yang wrote:
> > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > after completing the scan of the cold task.
> > > > > > >
> > > > > > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > > > > > > memory is cold or will be freed, it is appropriate for khugepaged to
> > > > > > > scan it only at the latest possible moment, thereby avoiding unnecessary
> > > > > > > scan and collapse operations to reducing CPU wastage.
> > > > > > >
> > > > > > > Here are the performance test results:
> > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > >
> > > > > > > Testing on x86_64 machine:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> > > > > > > | cycles per access   |  4.91         |  2.07         | -57.84% |
> > > > > > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> > > > > > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> > > > > > >
> > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > | cycles per access   |  7.23         |  2.12         | -70.68% |
> > > > > > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> > > > > > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
> > > > > >
> > > > > > Again, I also don't like that because you make assumptions on a full process
> > > > > > based on some part of it's address space.
> > > > > >
> > > > > > E.g., if a library issues a MADV_COLD on some part of the memory the library
> > > > > > manages, why should the remaining part of the process suffer as well?
> > > > >
> > > > > Yes, you make a good point, thanks!
> > > > >
> > > > > > This seems to be an heuristic focused on some specific workloads, no?
> > > > >
> > > > > Right.
> > > > >
> > > > > Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> > > > > not be collapsed, so that khugepaged can simply skip this VMA during
> > > > > scanning? This way, it won't affect the remaining part of the task's
> > > > > memory regions.
> > > >
> > > > I thought we would skip these regions already properly in khugeapged, or
> > > > maybe I misunderstood your question.
> > > >
> > >
> > > I think we should, but seems we didn't do this for anonymous memory during
> > > khugepaged.
> > >
> > > We check the vma with thp_vma_allowable_order() during scan.
> > >
> > >    * For anonymous memory during khugepaged, if we always enable 2M collapse,
> > >      we will scan this vma. Even VM_NOHUGEPAGE is set.
> > >
> > >    * For other cases, it looks good since __thp_vma_allowable_order() will skip
> > >      this vma with vma_thp_disabled().
> >
> > Hi David, Wei,
> >
> > The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> > memory during scan, as below:
> >
> > khugepaged_scan_mm_slot()
> >      thp_vma_allowable_order()
> >          thp_vma_allowable_orders()
> >              __thp_vma_allowable_orders()
> >                  vma_thp_disabled() {
> >                       if (vm_flags & VM_NOHUGEPAGE)
> >                           return true;
> >                  }
> >
> > REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
> > so the khugepaged will continue scan this vma.
> >
> > I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> > been successful. I will send it in the next version.
>
> No we must not do that. That's a user-space visible change. :/

David, what good ideas do you have to achieve this goal? let me know
please, thank!

--
Thanks,
Vernon

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by David Hildenbrand (Red Hat) 1 month, 2 weeks ago

On 12/21/25 13:34, Vernon Yang wrote:
> On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/21/25 05:25, Vernon Yang wrote:
>>> On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>>>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/19/25 06:29, Vernon Yang wrote:
>>>>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>>>> On 12/15/25 10:04, Vernon Yang wrote:
>>>>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>>>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>>>>> after completing the scan of the cold task.
>>>>>>>>
>>>>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>>>>>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>>>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>>>>>>> scan and collapse operations to reducing CPU wastage.
>>>>>>>>
>>>>>>>> Here are the performance test results:
>>>>>>>> (Throughput bigger is better, other smaller is better)
>>>>>>>>
>>>>>>>> Testing on x86_64 machine:
>>>>>>>>
>>>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>>>>>>>> | cycles per access   |  4.91         |  2.07         | -57.84% |
>>>>>>>> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>>>>>>>> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>>>>>>>>
>>>>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>>>>
>>>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>>>>>>> | cycles per access   |  7.23         |  2.12         | -70.68% |
>>>>>>>> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>>>>>>>> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>>>>>>>
>>>>>>> Again, I also don't like that because you make assumptions on a full process
>>>>>>> based on some part of it's address space.
>>>>>>>
>>>>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>>>>>>> manages, why should the remaining part of the process suffer as well?
>>>>>>
>>>>>> Yes, you make a good point, thanks!
>>>>>>
>>>>>>> This seems to be an heuristic focused on some specific workloads, no?
>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>>>>>> not be collapsed, so that khugepaged can simply skip this VMA during
>>>>>> scanning? This way, it won't affect the remaining part of the task's
>>>>>> memory regions.
>>>>>
>>>>> I thought we would skip these regions already properly in khugeapged, or
>>>>> maybe I misunderstood your question.
>>>>>
>>>>
>>>> I think we should, but seems we didn't do this for anonymous memory during
>>>> khugepaged.
>>>>
>>>> We check the vma with thp_vma_allowable_order() during scan.
>>>>
>>>>     * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>>>       we will scan this vma. Even VM_NOHUGEPAGE is set.
>>>>
>>>>     * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>>>       this vma with vma_thp_disabled().
>>>
>>> Hi David, Wei,
>>>
>>> The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
>>> memory during scan, as below:
>>>
>>> khugepaged_scan_mm_slot()
>>>       thp_vma_allowable_order()
>>>           thp_vma_allowable_orders()
>>>               __thp_vma_allowable_orders()
>>>                   vma_thp_disabled() {
>>>                        if (vm_flags & VM_NOHUGEPAGE)
>>>                            return true;
>>>                   }
>>>
>>> REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
>>> so the khugepaged will continue scan this vma.
>>>
>>> I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
>>> been successful. I will send it in the next version.
>>
>> No we must not do that. That's a user-space visible change. :/
> 
> David, what good ideas do you have to achieve this goal? let me know
> please, thank!

Your idea would be to skip a VMA when we issues madvise(MADV_COLD).

That sounds like yet another heuristic that can easily be wrong? :/

In particular, imagine if the VMA is much larger than the madvise'd 
region (other parts used for something else) or if the previously cold 
memory area is used for something that is now hot.

With memory allocators that manage most of the memory in a single large 
VMA, it's rather easy to see how such a heuristic would be bad, no?

-- 
Cheers

David

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Vernon Yang 1 month, 2 weeks ago

On Tue, Dec 23, 2025 at 10:59:29AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/21/25 13:34, Vernon Yang wrote:
> > On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
> > > On 12/21/25 05:25, Vernon Yang wrote:
> > > > On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> > > > > On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > On 12/19/25 06:29, Vernon Yang wrote:
> > > > > > > On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > > > On 12/15/25 10:04, Vernon Yang wrote:
> > > > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > > > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > > > after completing the scan of the cold task.
> > > > > > > > >
> > > > > > > > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > > > > > > > > memory is cold or will be freed, it is appropriate for khugepaged to
> > > > > > > > > scan it only at the latest possible moment, thereby avoiding unnecessary
> > > > > > > > > scan and collapse operations to reducing CPU wastage.
> > > > > > > > >
> > > > > > > > > Here are the performance test results:
> > > > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > > > >
> > > > > > > > > Testing on x86_64 machine:
> > > > > > > > >
> > > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> > > > > > > > > | cycles per access   |  4.91         |  2.07         | -57.84% |
> > > > > > > > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> > > > > > > > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> > > > > > > > >
> > > > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > > > >
> > > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > > > | cycles per access   |  7.23         |  2.12         | -70.68% |
> > > > > > > > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> > > > > > > > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
> > > > > > > >
> > > > > > > > Again, I also don't like that because you make assumptions on a full process
> > > > > > > > based on some part of it's address space.
> > > > > > > >
> > > > > > > > E.g., if a library issues a MADV_COLD on some part of the memory the library
> > > > > > > > manages, why should the remaining part of the process suffer as well?
> > > > > > >
> > > > > > > Yes, you make a good point, thanks!
> > > > > > >
> > > > > > > > This seems to be an heuristic focused on some specific workloads, no?
> > > > > > >
> > > > > > > Right.
> > > > > > >
> > > > > > > Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> > > > > > > not be collapsed, so that khugepaged can simply skip this VMA during
> > > > > > > scanning? This way, it won't affect the remaining part of the task's
> > > > > > > memory regions.
> > > > > >
> > > > > > I thought we would skip these regions already properly in khugeapged, or
> > > > > > maybe I misunderstood your question.
> > > > > >
> > > > >
> > > > > I think we should, but seems we didn't do this for anonymous memory during
> > > > > khugepaged.
> > > > >
> > > > > We check the vma with thp_vma_allowable_order() during scan.
> > > > >
> > > > >     * For anonymous memory during khugepaged, if we always enable 2M collapse,
> > > > >       we will scan this vma. Even VM_NOHUGEPAGE is set.
> > > > >
> > > > >     * For other cases, it looks good since __thp_vma_allowable_order() will skip
> > > > >       this vma with vma_thp_disabled().
> > > >
> > > > Hi David, Wei,
> > > >
> > > > The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> > > > memory during scan, as below:
> > > >
> > > > khugepaged_scan_mm_slot()
> > > >       thp_vma_allowable_order()
> > > >           thp_vma_allowable_orders()
> > > >               __thp_vma_allowable_orders()
> > > >                   vma_thp_disabled() {
> > > >                        if (vm_flags & VM_NOHUGEPAGE)
> > > >                            return true;
> > > >                   }
> > > >
> > > > REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
> > > > so the khugepaged will continue scan this vma.
> > > >
> > > > I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> > > > been successful. I will send it in the next version.
> > >
> > > No we must not do that. That's a user-space visible change. :/
> >
> > David, what good ideas do you have to achieve this goal? let me know
> > please, thank!
>
> Your idea would be to skip a VMA when we issues madvise(MADV_COLD).
>
> That sounds like yet another heuristic that can easily be wrong? :/
>
> In particular, imagine if the VMA is much larger than the madvise'd region
> (other parts used for something else) or if the previously cold memory area
> is used for something that is now hot.
>
> With memory allocators that manage most of the memory in a single large VMA,
> it's rather easy to see how such a heuristic would be bad, no?

Thanks for your explain, but I current approach is as follows, the large
VMA will split at this case.

madvise_vma_behavior
    madvise_cold
    madvise_update_vma

Maybe I'll send v2 first, and we'll discuss it more clearly :)

--
Merry Christmas,
Vernon

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Wei Yang 1 month, 2 weeks ago

On Sun, Dec 21, 2025 at 12:25:44PM +0800, Vernon Yang wrote:
>On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>> >On 12/19/25 06:29, Vernon Yang wrote:
>> >> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> >> > On 12/15/25 10:04, Vernon Yang wrote:
>> >> > > For example, create three task: hot1 -> cold -> hot2. After all three
>> >> > > task are created, each allocate memory 128MB. the hot1/hot2 task
>> >> > > continuously access 128 MB memory, while the cold task only accesses
>> >> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>> >> > > still prioritizes scanning the cold task and only scans the hot2 task
>> >> > > after completing the scan of the cold task.
>> >> > >
>> >> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
>> >> > > memory is cold or will be freed, it is appropriate for khugepaged to
>> >> > > scan it only at the latest possible moment, thereby avoiding unnecessary
>> >> > > scan and collapse operations to reducing CPU wastage.
>> >> > >
>> >> > > Here are the performance test results:
>> >> > > (Throughput bigger is better, other smaller is better)
>> >> > >
>> >> > > Testing on x86_64 machine:
>> >> > >
>> >> > > | task hot2           | without patch | with patch    |  delta  |
>> >> > > |---------------------|---------------|---------------|---------|
>> >> > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>> >> > > | cycles per access   |  4.91         |  2.07         | -57.84% |
>> >> > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>> >> > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>> >> > >
>> >> > > Testing on qemu-system-x86_64 -enable-kvm:
>> >> > >
>> >> > > | task hot2           | without patch | with patch    |  delta  |
>> >> > > |---------------------|---------------|---------------|---------|
>> >> > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>> >> > > | cycles per access   |  7.23         |  2.12         | -70.68% |
>> >> > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>> >> > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>> >> >
>> >> > Again, I also don't like that because you make assumptions on a full process
>> >> > based on some part of it's address space.
>> >> >
>> >> > E.g., if a library issues a MADV_COLD on some part of the memory the library
>> >> > manages, why should the remaining part of the process suffer as well?
>> >>
>> >> Yes, you make a good point, thanks!
>> >>
>> >> > This seems to be an heuristic focused on some specific workloads, no?
>> >>
>> >> Right.
>> >>
>> >> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>> >> not be collapsed, so that khugepaged can simply skip this VMA during
>> >> scanning? This way, it won't affect the remaining part of the task's
>> >> memory regions.
>> >
>> >I thought we would skip these regions already properly in khugeapged, or
>> >maybe I misunderstood your question.
>> >
>>
>> I think we should, but seems we didn't do this for anonymous memory during
>> khugepaged.
>>
>> We check the vma with thp_vma_allowable_order() during scan.
>>
>>   * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>     we will scan this vma. Even VM_NOHUGEPAGE is set.
>>
>>   * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>     this vma with vma_thp_disabled().
>
>Hi David, Wei,
>
>The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
>memory during scan, as below:
>
>khugepaged_scan_mm_slot()
>    thp_vma_allowable_order()
>        thp_vma_allowable_orders()

Oops, you are right. It only bypass __thp_vma_allowable_order() if orders is
0.

>            __thp_vma_allowable_orders()
>                vma_thp_disabled() {
>                     if (vm_flags & VM_NOHUGEPAGE)
>                         return true;
>                }
>
>REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
>so the khugepaged will continue scan this vma.
>
>I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
>been successful. I will send it in the next version.
>
>--
>Thanks,
>Vernon

-- 
Wei Yang
Help you, Help me

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by kernel test robot 1 month, 3 weeks ago

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20251216]
[cannot apply to linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251216/202512161405.8IVTXVcr-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512161405.8IVTXVcr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512161405.8IVTXVcr-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/madvise.c: In function 'madvise_cold':
>> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
     609 |         khugepaged_move_tail(vma->vm_mm);
         |         ^~~~~~~~~~~~~~~~~~~~
         |         khugepaged_exit


vim +609 mm/madvise.c

   595	
   596	static long madvise_cold(struct madvise_behavior *madv_behavior)
   597	{
   598		struct vm_area_struct *vma = madv_behavior->vma;
   599		struct mmu_gather tlb;
   600	
   601		if (!can_madv_lru_vma(vma))
   602			return -EINVAL;
   603	
   604		lru_add_drain();
   605		tlb_gather_mmu(&tlb, madv_behavior->mm);
   606		madvise_cold_page_range(&tlb, madv_behavior);
   607		tlb_finish_mmu(&tlb);
   608	
 > 609		khugepaged_move_tail(vma->vm_mm);
   610	
   611		return 0;
   612	}
   613	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by kernel test robot 1 month, 3 weeks ago

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251216/202512161406.RfF1dIYB-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512161406.RfF1dIYB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512161406.RfF1dIYB-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/madvise.c:609:2: error: call to undeclared function 'khugepaged_move_tail'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     609 |         khugepaged_move_tail(vma->vm_mm);
         |         ^
   mm/madvise.c:837:2: error: call to undeclared function 'khugepaged_move_tail'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     837 |         khugepaged_move_tail(mm);
         |         ^
   2 errors generated.


vim +/khugepaged_move_tail +609 mm/madvise.c

   595	
   596	static long madvise_cold(struct madvise_behavior *madv_behavior)
   597	{
   598		struct vm_area_struct *vma = madv_behavior->vma;
   599		struct mmu_gather tlb;
   600	
   601		if (!can_madv_lru_vma(vma))
   602			return -EINVAL;
   603	
   604		lru_add_drain();
   605		tlb_gather_mmu(&tlb, madv_behavior->mm);
   606		madvise_cold_page_range(&tlb, madv_behavior);
   607		tlb_finish_mmu(&tlb);
   608	
 > 609		khugepaged_move_tail(vma->vm_mm);
   610	
   611		return 0;
   612	}
   613	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by kernel test robot 1 month, 3 weeks ago

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512160400.pTmarqg6-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/madvise.c: In function 'madvise_cold':
>> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
     609 |         khugepaged_move_tail(vma->vm_mm);
         |         ^~~~~~~~~~~~~~~~~~~~
         |         khugepaged_exit


vim +609 mm/madvise.c

   595	
   596	static long madvise_cold(struct madvise_behavior *madv_behavior)
   597	{
   598		struct vm_area_struct *vma = madv_behavior->vma;
   599		struct mmu_gather tlb;
   600	
   601		if (!can_madv_lru_vma(vma))
   602			return -EINVAL;
   603	
   604		lru_add_drain();
   605		tlb_gather_mmu(&tlb, madv_behavior->mm);
   606		madvise_cold_page_range(&tlb, madv_behavior);
   607		tlb_finish_mmu(&tlb);
   608	
 > 609		khugepaged_move_tail(vma->vm_mm);
   610	
   611		return 0;
   612	}
   613	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

Posted by Vernon Yang 1 month, 3 weeks ago

On Tue, Dec 16, 2025 at 05:12:16AM +0800, kernel test robot wrote:
> Hi Vernon,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.19-rc1 next-20251215]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
> patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
> config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/config)
> compiler: arc-linux-gcc (GCC) 15.1.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512160400.pTmarqg6-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>    mm/madvise.c: In function 'madvise_cold':
> >> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
>      609 |         khugepaged_move_tail(vma->vm_mm);
>          |         ^~~~~~~~~~~~~~~~~~~~
>          |         khugepaged_exit

When CONFIG_TRANSPARENT_HUGEPAGE is disabled, trigger this build errors.
I'll fix it in the next version, Thanks!

>
> vim +609 mm/madvise.c
>
>    595
>    596	static long madvise_cold(struct madvise_behavior *madv_behavior)
>    597	{
>    598		struct vm_area_struct *vma = madv_behavior->vma;
>    599		struct mmu_gather tlb;
>    600
>    601		if (!can_madv_lru_vma(vma))
>    602			return -EINVAL;
>    603
>    604		lru_add_drain();
>    605		tlb_gather_mmu(&tlb, madv_behavior->mm);
>    606		madvise_cold_page_range(&tlb, madv_behavior);
>    607		tlb_finish_mmu(&tlb);
>    608
>  > 609		khugepaged_move_tail(vma->vm_mm);
>    610
>    611		return 0;
>    612	}
>    613
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

--
Thanks,
Vernon

[PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
[PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
[PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
[PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY