[v1] mm/khugepaged: optimize collapse candidate detection

[PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lance Yang 2 weeks, 4 days ago

From: Lance Yang <lance.yang@linux.dev>

Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
mlocked VMAs should not be touched.

Note that the only other user of the VM_NO_KHUGEPAGED mask is
 __thp_vma_allowable_orders(), which is also used by the MADV_COLLAPSE
path. Since MADV_COLLAPSE has different rules (e.g., for mlocked VMAs), we
cannot simply make the shared mask stricter as that would break it.

So, we also introduce a new VM_NO_THP_COLLAPSE mask for that helper,
leaving the stricter checks to be applied only within the khugepaged path
itself.

Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/linux/mm.h |  6 +++++-
 mm/huge_memory.c   |  2 +-
 mm/khugepaged.c    | 14 +++++++++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index be3e6fb4d0db..cb54d94b2343 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -505,7 +505,11 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
 
 /* This mask prevents VMA from being scanned with khugepaged */
-#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
+#define VM_NO_KHUGEPAGED \
+	(VM_SPECIAL | VM_HUGETLB | VM_LOCKED_MASK | VM_NOHUGEPAGE)
+
+/* This mask prevents VMA from being collapsed by any THP path */
+#define VM_NO_THP_COLLAPSE	(VM_SPECIAL | VM_HUGETLB)
 
 /* This mask defines which mm->def_flags a process can inherit its parent */
 #define VM_INIT_DEF_MASK	VM_NOHUGEPAGE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6fc669e11c1..2e91526a037f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -134,7 +134,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	 * Must be checked after dax since some dax mappings may have
 	 * VM_MIXEDMAP set.
 	 */
-	if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
+	if (!in_pf && !smaps && (vm_flags & VM_NO_THP_COLLAPSE))
 		return 0;
 
 	/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7c5ff1b23e93..e54f99bb0b57 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -345,6 +345,17 @@ struct attribute_group khugepaged_attr_group = {
 };
 #endif /* CONFIG_SYSFS */
 
+/**
+ * khugepaged_should_scan_vma - check if a VMA is a candidate for collapse
+ * @vm_flags: The flags of the VMA to check.
+ *
+ * Returns: true if the VMA should be scanned by khugepaged, false otherwise.
+ */
+static inline bool khugepaged_should_scan_vma(vm_flags_t vm_flags)
+{
+	return !(vm_flags & VM_NO_KHUGEPAGED);
+}
+
 int hugepage_madvise(struct vm_area_struct *vma,
 		     vm_flags_t *vm_flags, int advice)
 {
@@ -2443,7 +2454,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!khugepaged_should_scan_vma(vma->vm_flags) ||
+		    !thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
 skip:
 			progress++;
 			continue;
-- 
2.49.0

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Hugh Dickins 2 weeks, 2 days ago

On Sun, 14 Sep 2025, Lance Yang wrote:

> From: Lance Yang <lance.yang@linux.dev>
> 
> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
> mlocked VMAs should not be touched.

Why?  That's a change in behaviour, isn't it?

I'm aware that hugepage collapse on an mlocked VMA can insert a fault
latency, not universally welcome; but I've not seen discussion, let
alone agreement, that current behaviour should be changed.
Somewhere in yet-to-be-read mail?  Please give us a link.

Hugh

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lance Yang 2 weeks, 2 days ago

Hi Hugh,

Thanks for taking a look and for raising this important point!

On 2025/9/16 13:32, Hugh Dickins wrote:
> On Sun, 14 Sep 2025, Lance Yang wrote:
> 
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
>> mlocked VMAs should not be touched.
> 
> Why?  That's a change in behaviour, isn't it?
> 
> I'm aware that hugepage collapse on an mlocked VMA can insert a fault
> latency, not universally welcome; but I've not seen discussion, let
> alone agreement, that current behaviour should be changed.
> Somewhere in yet-to-be-read mail?  Please give us a link.
> 
> Hugh

You're right, this is indeed a change in behaviour. But it's specifically
for khugepaged.

Users of mlock() expect low and predictable latency. THP collapse is a
heavy operation that introduces exactly the kind of unpredictable delays
they want to avoid. It has to unmap PTEs, copy data from the small folios
to a new THP, and then remap the THP back to the PMD ;)

IMO, that change is acceptable because THP is generally transparent to
users, and khugepaged does not guarantee when THP collapse or split will
happen.

Well, we don't have a discussion on that, just something I noticed.

Thanks,
Lance

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Kiryl Shutsemau 2 weeks, 2 days ago

On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> Users of mlock() expect low and predictable latency. THP collapse is a
> heavy operation that introduces exactly the kind of unpredictable delays
> they want to avoid. It has to unmap PTEs, copy data from the small folios
> to a new THP, and then remap the THP back to the PMD ;)

Generally, we allow minor page faults into mlocked VMAs and avoid major.
This is minor page fault territory in my view.

Also it is very similar to what compaction does and we allow compaction
of mlocked VMA by default, unless sysctl vm.compact_unevictable_allowed
is set to zero.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lorenzo Stoakes 2 weeks, 2 days ago

On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> > Users of mlock() expect low and predictable latency. THP collapse is a
> > heavy operation that introduces exactly the kind of unpredictable delays
> > they want to avoid. It has to unmap PTEs, copy data from the small folios
> > to a new THP, and then remap the THP back to the PMD ;)
>
> Generally, we allow minor page faults into mlocked VMAs and avoid major.
> This is minor page fault territory in my view.

Hm, but we won't be causing minor faults via reclaim right, since they're
not on any LRU?

>
> Also it is very similar to what compaction does and we allow compaction
> of mlocked VMA by default, unless sysctl vm.compact_unevictable_allowed
> is set to zero.

This is a much stronger point.

I think we are sometimes too vague as to what mlock() means in
totality. But given that we default allow compaction it seems sensible to
keep this behaviour the same.

Unless you have a specific situation where this is problematic Lance?

>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Cheers, Lorenzo

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lance Yang 2 weeks, 2 days ago


On 2025/9/16 17:39, Lorenzo Stoakes wrote:
> On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
>> On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
>>> Users of mlock() expect low and predictable latency. THP collapse is a
>>> heavy operation that introduces exactly the kind of unpredictable delays
>>> they want to avoid. It has to unmap PTEs, copy data from the small folios
>>> to a new THP, and then remap the THP back to the PMD ;)
>>
>> Generally, we allow minor page faults into mlocked VMAs and avoid major.
>> This is minor page fault territory in my view.

Makes sense to me!

> 
> Hm, but we won't be causing minor faults via reclaim right, since they're
> not on any LRU?
> 
>>
>> Also it is very similar to what compaction does and we allow compaction
>> of mlocked VMA by default, unless sysctl vm.compact_unevictable_allowed
>> is set to zero.
> 
> This is a much stronger point.

Ah, indeed, the compaction analogy is quite strong here, thanks!

> 
> I think we are sometimes too vague as to what mlock() means in

Totally agree on too vague ;)

> totality. But given that we default allow compaction it seems sensible to
> keep this behaviour the same.
> 
> Unless you have a specific situation where this is problematic Lance?

Not a specific situation right now that would clearly make this problematic.

Anyway, I will drop this patch from the series.

Thanks again for all the feedback everyone!
Lance

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Kiryl Shutsemau 2 weeks, 2 days ago

On Tue, Sep 16, 2025 at 10:39:53AM +0100, Lorenzo Stoakes wrote:
> On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
> > On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> > > Users of mlock() expect low and predictable latency. THP collapse is a
> > > heavy operation that introduces exactly the kind of unpredictable delays
> > > they want to avoid. It has to unmap PTEs, copy data from the small folios
> > > to a new THP, and then remap the THP back to the PMD ;)
> >
> > Generally, we allow minor page faults into mlocked VMAs and avoid major.
> > This is minor page fault territory in my view.
> 
> Hm, but we won't be causing minor faults via reclaim right, since they're
> not on any LRU?

PTEs are still present when we do THP allocation. No reclaim while the
access is blocked. We only block the access on copy and PTEs->PMD
collapse.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lorenzo Stoakes 2 weeks, 2 days ago

On Tue, Sep 16, 2025 at 10:48:22AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 10:39:53AM +0100, Lorenzo Stoakes wrote:
> > On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
> > > On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> > > > Users of mlock() expect low and predictable latency. THP collapse is a
> > > > heavy operation that introduces exactly the kind of unpredictable delays
> > > > they want to avoid. It has to unmap PTEs, copy data from the small folios
> > > > to a new THP, and then remap the THP back to the PMD ;)
> > >
> > > Generally, we allow minor page faults into mlocked VMAs and avoid major.
> > > This is minor page fault territory in my view.
> >
> > Hm, but we won't be causing minor faults via reclaim right, since they're
> > not on any LRU?
>
> PTEs are still present when we do THP allocation. No reclaim while the
> access is blocked. We only block the access on copy and PTEs->PMD
> collapse.

Right indeed, esp. with compaction being allowed for mlock, I agree with you
that this patch should be dropped :)

>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Cheers, Lorenzo

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lance Yang 2 weeks, 2 days ago


On 2025/9/16 17:58, Lorenzo Stoakes wrote:
> On Tue, Sep 16, 2025 at 10:48:22AM +0100, Kiryl Shutsemau wrote:
>> On Tue, Sep 16, 2025 at 10:39:53AM +0100, Lorenzo Stoakes wrote:
>>> On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
>>>> On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
>>>>> Users of mlock() expect low and predictable latency. THP collapse is a
>>>>> heavy operation that introduces exactly the kind of unpredictable delays
>>>>> they want to avoid. It has to unmap PTEs, copy data from the small folios
>>>>> to a new THP, and then remap the THP back to the PMD ;)
>>>>
>>>> Generally, we allow minor page faults into mlocked VMAs and avoid major.
>>>> This is minor page fault territory in my view.
>>>
>>> Hm, but we won't be causing minor faults via reclaim right, since they're
>>> not on any LRU?
>>
>> PTEs are still present when we do THP allocation. No reclaim while the
>> access is blocked. We only block the access on copy and PTEs->PMD
>> collapse.
> 
> Right indeed, esp. with compaction being allowed for mlock, I agree with you
> that this patch should be dropped :)

Got it. Will do ;)

Thanks,
Lance

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Hugh Dickins 2 weeks, 2 days ago

On Tue, 16 Sep 2025, Lance Yang wrote:

> Hi Hugh,
> 
> Thanks for taking a look and for raising this important point!
> 
> On 2025/9/16 13:32, Hugh Dickins wrote:
> > On Sun, 14 Sep 2025, Lance Yang wrote:
> > 
> >> From: Lance Yang <lance.yang@linux.dev>
> >>
> >> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
> >> mlocked VMAs should not be touched.
> > 
> > Why?  That's a change in behaviour, isn't it?
> > 
> > I'm aware that hugepage collapse on an mlocked VMA can insert a fault
> > latency, not universally welcome; but I've not seen discussion, let
> > alone agreement, that current behaviour should be changed.
> > Somewhere in yet-to-be-read mail?  Please give us a link.
> > 
> > Hugh
> 
> You're right, this is indeed a change in behaviour. But it's specifically
> for khugepaged.
> 
> Users of mlock() expect low and predictable latency. THP collapse is a
> heavy operation that introduces exactly the kind of unpredictable delays
> they want to avoid. It has to unmap PTEs, copy data from the small folios
> to a new THP, and then remap the THP back to the PMD ;)
> 
> IMO, that change is acceptable because THP is generally transparent to
> users, and khugepaged does not guarantee when THP collapse or split will
> happen.

I disagree.  Many of those who have khugepaged enabled would prefer
it to give them hugepages, even or especially on mlocked areas.

If you make that change, it must be guarded by a sysfs or sysctl tuning.

Perhaps it could share the sysctl_compact_unevictable_allowed tuning
(I'm not sure whether that's a good or bad idea: opinions will differ).

Hugh

> 
> Well, we don't have a discussion on that, just something I noticed.
> 
> Thanks,
> Lance

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lance Yang 2 weeks, 2 days ago


On 2025/9/16 14:42, Hugh Dickins wrote:
> On Tue, 16 Sep 2025, Lance Yang wrote:
> 
>> Hi Hugh,
>>
>> Thanks for taking a look and for raising this important point!
>>
>> On 2025/9/16 13:32, Hugh Dickins wrote:
>>> On Sun, 14 Sep 2025, Lance Yang wrote:
>>>
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
>>>> mlocked VMAs should not be touched.
>>>
>>> Why?  That's a change in behaviour, isn't it?
>>>
>>> I'm aware that hugepage collapse on an mlocked VMA can insert a fault
>>> latency, not universally welcome; but I've not seen discussion, let
>>> alone agreement, that current behaviour should be changed.
>>> Somewhere in yet-to-be-read mail?  Please give us a link.
>>>
>>> Hugh
>>
>> You're right, this is indeed a change in behaviour. But it's specifically
>> for khugepaged.
>>
>> Users of mlock() expect low and predictable latency. THP collapse is a
>> heavy operation that introduces exactly the kind of unpredictable delays
>> they want to avoid. It has to unmap PTEs, copy data from the small folios
>> to a new THP, and then remap the THP back to the PMD ;)
>>
>> IMO, that change is acceptable because THP is generally transparent to
>> users, and khugepaged does not guarantee when THP collapse or split will
>> happen.
> 
> I disagree.  Many of those who have khugepaged enabled would prefer
> it to give them hugepages, even or especially on mlocked areas.
> 
> If you make that change, it must be guarded by a sysfs or sysctl tuning.

Thanks for the feedback!

Well, seems like we're not on the same page. Let's gather more opinions from
other folks ;)

> 
> Perhaps it could share the sysctl_compact_unevictable_allowed tuning
> (I'm not sure whether that's a good or bad idea: opinions will differ).

Thanks,
Lance

> 
> Hugh
> 
>>
>> Well, we don't have a discussion on that, just something I noticed.
>>
>> Thanks,
>> Lance

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Dev Jain 2 weeks, 4 days ago

On 14/09/25 8:05 pm, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
> mlocked VMAs should not be touched.
>
> Note that the only other user of the VM_NO_KHUGEPAGED mask is
>   __thp_vma_allowable_orders(), which is also used by the MADV_COLLAPSE
> path. Since MADV_COLLAPSE has different rules (e.g., for mlocked VMAs), we
> cannot simply make the shared mask stricter as that would break it.
>
> So, we also introduce a new VM_NO_THP_COLLAPSE mask for that helper,
> leaving the stricter checks to be applied only within the khugepaged path
> itself.
>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   include/linux/mm.h |  6 +++++-
>   mm/huge_memory.c   |  2 +-
>   mm/khugepaged.c    | 14 +++++++++++++-
>   3 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index be3e6fb4d0db..cb54d94b2343 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -505,7 +505,11 @@ extern unsigned int kobjsize(const void *objp);
>   #define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
>   
>   /* This mask prevents VMA from being scanned with khugepaged */
> -#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
> +#define VM_NO_KHUGEPAGED \
> +	(VM_SPECIAL | VM_HUGETLB | VM_LOCKED_MASK | VM_NOHUGEPAGE)
> +
> +/* This mask prevents VMA from being collapsed by any THP path */
> +#define VM_NO_THP_COLLAPSE	(VM_SPECIAL | VM_HUGETLB)

VM_NO_KHUGEPAGED should then be defined as VM_NO_THP_COLLAPSE | VM_LOCKED_MASK | VM_NOHUGEPAGE.
But...

I believe that the eligibility checking for khugepaged collapse is the business of
thp_vma_allowable_order(). This functionality should be put there, we literally
have a TVA_KHUGEPAGED flag :)

>   
>   /* This mask defines which mm->def_flags a process can inherit its parent */
>   #define VM_INIT_DEF_MASK	VM_NOHUGEPAGE
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d6fc669e11c1..2e91526a037f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -134,7 +134,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>   	 * Must be checked after dax since some dax mappings may have
>   	 * VM_MIXEDMAP set.
>   	 */
> -	if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
> +	if (!in_pf && !smaps && (vm_flags & VM_NO_THP_COLLAPSE))
>   		return 0;
>   
>   	/*
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7c5ff1b23e93..e54f99bb0b57 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -345,6 +345,17 @@ struct attribute_group khugepaged_attr_group = {
>   };
>   #endif /* CONFIG_SYSFS */
>   
> +/**
> + * khugepaged_should_scan_vma - check if a VMA is a candidate for collapse
> + * @vm_flags: The flags of the VMA to check.
> + *
> + * Returns: true if the VMA should be scanned by khugepaged, false otherwise.
> + */
> +static inline bool khugepaged_should_scan_vma(vm_flags_t vm_flags)
> +{
> +	return !(vm_flags & VM_NO_KHUGEPAGED);
> +}
> +
>   int hugepage_madvise(struct vm_area_struct *vma,
>   		     vm_flags_t *vm_flags, int advice)
>   {
> @@ -2443,7 +2454,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   			progress++;
>   			break;
>   		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!khugepaged_should_scan_vma(vma->vm_flags) ||
> +		    !thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
>   skip:
>   			progress++;
>   			continue;

Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()

Posted by Lance Yang 2 weeks, 3 days ago

Hey Dev,

Thanks for taking time to review!

On 2025/9/15 00:16, Dev Jain wrote:
> 
> On 14/09/25 8:05 pm, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
>> mlocked VMAs should not be touched.
>>
>> Note that the only other user of the VM_NO_KHUGEPAGED mask is
>>   __thp_vma_allowable_orders(), which is also used by the MADV_COLLAPSE
>> path. Since MADV_COLLAPSE has different rules (e.g., for mlocked 
>> VMAs), we
>> cannot simply make the shared mask stricter as that would break it.
>>
>> So, we also introduce a new VM_NO_THP_COLLAPSE mask for that helper,
>> leaving the stricter checks to be applied only within the khugepaged path
>> itself.
>>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   include/linux/mm.h |  6 +++++-
>>   mm/huge_memory.c   |  2 +-
>>   mm/khugepaged.c    | 14 +++++++++++++-
>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index be3e6fb4d0db..cb54d94b2343 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -505,7 +505,11 @@ extern unsigned int kobjsize(const void *objp);
>>   #define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
>> VM_DONTDUMP)
>>   /* This mask prevents VMA from being scanned with khugepaged */
>> -#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
>> +#define VM_NO_KHUGEPAGED \
>> +    (VM_SPECIAL | VM_HUGETLB | VM_LOCKED_MASK | VM_NOHUGEPAGE)
>> +
>> +/* This mask prevents VMA from being collapsed by any THP path */
>> +#define VM_NO_THP_COLLAPSE    (VM_SPECIAL | VM_HUGETLB)
> 
> VM_NO_KHUGEPAGED should then be defined as VM_NO_THP_COLLAPSE | 
> VM_LOCKED_MASK | VM_NOHUGEPAGE.

Yep, it's a good cleanup ;)

> But...
> 
> I believe that the eligibility checking for khugepaged collapse is the 
> business of
> thp_vma_allowable_order(). This functionality should be put there, we 
> literally
> have a TVA_KHUGEPAGED flag :)

Good spot. That's a much better apporach!

My initial thinking was to keep thp_vma_allowable_order() as generic as
possible, avoiding specific checks for individual callers ;)

BUT you are right, the TVA_KHUGEPAGED flag is only passed from the
khugepaged path, so the compiler will optimize out the branch for other
callers, leaving no runtime overhead.

Will rework this patch for v2 as your suggestion!

Thanks,
Lance

> 
>>   /* This mask defines which mm->def_flags a process can inherit its 
>> parent */
>>   #define VM_INIT_DEF_MASK    VM_NOHUGEPAGE
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d6fc669e11c1..2e91526a037f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -134,7 +134,7 @@ unsigned long __thp_vma_allowable_orders(struct 
>> vm_area_struct *vma,
>>        * Must be checked after dax since some dax mappings may have
>>        * VM_MIXEDMAP set.
>>        */
>> -    if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>> +    if (!in_pf && !smaps && (vm_flags & VM_NO_THP_COLLAPSE))
>>           return 0;
>>       /*
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 7c5ff1b23e93..e54f99bb0b57 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -345,6 +345,17 @@ struct attribute_group khugepaged_attr_group = {
>>   };
>>   #endif /* CONFIG_SYSFS */
>> +/**
>> + * khugepaged_should_scan_vma - check if a VMA is a candidate for 
>> collapse
>> + * @vm_flags: The flags of the VMA to check.
>> + *
>> + * Returns: true if the VMA should be scanned by khugepaged, false 
>> otherwise.
>> + */
>> +static inline bool khugepaged_should_scan_vma(vm_flags_t vm_flags)
>> +{
>> +    return !(vm_flags & VM_NO_KHUGEPAGED);
>> +}
>> +
>>   int hugepage_madvise(struct vm_area_struct *vma,
>>                vm_flags_t *vm_flags, int advice)
>>   {
>> @@ -2443,7 +2454,8 @@ static unsigned int 
>> khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>               progress++;
>>               break;
>>           }
>> -        if (!thp_vma_allowable_order(vma, vma->vm_flags, 
>> TVA_KHUGEPAGED, PMD_ORDER)) {
>> +        if (!khugepaged_should_scan_vma(vma->vm_flags) ||
>> +            !thp_vma_allowable_order(vma, vma->vm_flags, 
>> TVA_KHUGEPAGED, PMD_ORDER)) {
>>   skip:
>>               progress++;
>>               continue;