[PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled

Baolin Wang posted 2 patches 3 months, 2 weeks ago
There is a newer version of this series
[PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Baolin Wang 3 months, 2 weeks ago
When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
callers who do not specify this flag, it creates a odd and surprising situation
where a sysadmin specifying 'never' for all THP sizes still observing THP pages
being allocated and used on the system.

The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
the system-wide Anon THP sysfs settings, which means that even though we have
disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
into a Anon THP. This violates the rule we have agreed upon: never means never.

Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
collapse_pte_mapped_thp() function, but I believe this is reasonable from its
comments:

"
/*
 * If we are here, we've succeeded in replacing all the native pages
 * in the page cache with a single hugepage. If a mm were to fault-in
 * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
 * analogously elide sysfs THP settings here.
 */
if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
"

Another rule for madvise, referring to David's suggestion: “allowing for
collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".

To address this issue, the current strategy should be:

If no hugepage modes are enabled for the desired orders, nor can we enable them
by inheriting from a 'global' enabled setting - then it must be the case that
all desired orders either specify or inherit 'NEVER' - and we must abort.

Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling
THP.

Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
 tools/testing/selftests/mm/khugepaged.c |  6 +--
 2 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4d5bb67dc4ec..ab70ca4e704b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -267,6 +267,42 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long tva_flags,
 					 unsigned long orders);
 
+/* Strictly mask requested anonymous orders according to sysfs settings. */
+static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags,
+		unsigned long tva_flags, unsigned long orders)
+{
+	const unsigned long always = READ_ONCE(huge_anon_orders_always);
+	const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
+	const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
+	const unsigned long never = ~(always | madvise | inherit);
+	const bool inherit_never = !hugepage_global_enabled();
+
+	/* Disallow orders that are set to NEVER directly ... */
+	orders &= ~never;
+
+	/* ... or through inheritance (global == NEVER). */
+	if (inherit_never)
+		orders &= ~inherit;
+
+	/*
+	 * Otherwise, we only enforce sysfs settings if asked. In addition,
+	 * if the user sets a sysfs mode of madvise and if TVA_ENFORCE_SYSFS
+	 * is not set, we don't bother checking whether the VMA has VM_HUGEPAGE
+	 * set.
+	 */
+	if (!(tva_flags & TVA_ENFORCE_SYSFS))
+		return orders;
+
+	/* We already excluded never inherit above. */
+	if (vm_flags & VM_HUGEPAGE)
+		return orders & (always | madvise | inherit);
+
+	if (hugepage_global_always())
+		return orders & (always | inherit);
+
+	return orders & always;
+}
+
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check
@@ -289,19 +325,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       unsigned long orders)
 {
 	/* Optimization to check if required orders are enabled early. */
-	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
-		unsigned long mask = READ_ONCE(huge_anon_orders_always);
-
-		if (vm_flags & VM_HUGEPAGE)
-			mask |= READ_ONCE(huge_anon_orders_madvise);
-		if (hugepage_global_always() ||
-		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
-			mask |= READ_ONCE(huge_anon_orders_inherit);
-
-		orders &= mask;
-		if (!orders)
-			return 0;
-	}
+	if (vma_is_anonymous(vma))
+		orders = __thp_mask_anon_orders(vm_flags, tva_flags, orders);
 
 	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 4341ce6b3b38..85bfff53dba6 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 
 	printf("%s...", msg);
 
-	/*
-	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
-	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
-	 */
-	settings.thp_enabled = THP_NEVER;
+	settings.thp_enabled = THP_ALWAYS;
 	settings.shmem_enabled = SHMEM_NEVER;
 	thp_push_settings(&settings);
 
-- 
2.43.5

Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Dev Jain 3 months, 2 weeks ago
On 23/06/25 1:58 pm, Baolin Wang wrote:
> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
> callers who do not specify this flag, it creates a odd and surprising situation
> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
> being allocated and used on the system.
>
> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
> the system-wide Anon THP sysfs settings, which means that even though we have
> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
> into a Anon THP. This violates the rule we have agreed upon: never means never.
>
> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
> collapse_pte_mapped_thp() function, but I believe this is reasonable from its
> comments:
>
> "
> /*
>   * If we are here, we've succeeded in replacing all the native pages
>   * in the page cache with a single hugepage. If a mm were to fault-in
>   * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>   * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>   * analogously elide sysfs THP settings here.
>   */
> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))

So the behaviour now is: First check whether THP settings converge to never.
Then, if enforce_sysfs is not set, return immediately. So in this khugepaged
code will it be better to call __thp_vma_allowable_orders()? If the sysfs
settings are changed to never before hitting collapse_pte_mapped_thp(),
then right now we will return SCAN_VMA_CHECK from here, whereas, the comment
says "regardless of sysfs THP settings", which should include "regardless
of whether the sysfs settings say never".

> "
>
> Another rule for madvise, referring to David's suggestion: “allowing for
> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>
> To address this issue, the current strategy should be:
>
> If no hugepage modes are enabled for the desired orders, nor can we enable them
> by inheriting from a 'global' enabled setting - then it must be the case that
> all desired orders either specify or inherit 'NEVER' - and we must abort.
>
> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling
> THP.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>   tools/testing/selftests/mm/khugepaged.c |  6 +--
>   2 files changed, 39 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4d5bb67dc4ec..ab70ca4e704b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -267,6 +267,42 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>   					 unsigned long tva_flags,
>   					 unsigned long orders);
>   
> +/* Strictly mask requested anonymous orders according to sysfs settings. */
> +static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags,
> +		unsigned long tva_flags, unsigned long orders)
> +{
> +	const unsigned long always = READ_ONCE(huge_anon_orders_always);
> +	const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
> +	const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
> +	const unsigned long never = ~(always | madvise | inherit);
> +	const bool inherit_never = !hugepage_global_enabled();
> +
> +	/* Disallow orders that are set to NEVER directly ... */
> +	orders &= ~never;
> +
> +	/* ... or through inheritance (global == NEVER). */
> +	if (inherit_never)
> +		orders &= ~inherit;
> +
> +	/*
> +	 * Otherwise, we only enforce sysfs settings if asked. In addition,
> +	 * if the user sets a sysfs mode of madvise and if TVA_ENFORCE_SYSFS
> +	 * is not set, we don't bother checking whether the VMA has VM_HUGEPAGE
> +	 * set.
> +	 */
> +	if (!(tva_flags & TVA_ENFORCE_SYSFS))
> +		return orders;
> +
> +	/* We already excluded never inherit above. */
> +	if (vm_flags & VM_HUGEPAGE)
> +		return orders & (always | madvise | inherit);
> +
> +	if (hugepage_global_always())
> +		return orders & (always | inherit);
> +
> +	return orders & always;
> +}
> +
>   /**
>    * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>    * @vma:  the vm area to check
> @@ -289,19 +325,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>   				       unsigned long orders)
>   {
>   	/* Optimization to check if required orders are enabled early. */
> -	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> -		unsigned long mask = READ_ONCE(huge_anon_orders_always);
> -
> -		if (vm_flags & VM_HUGEPAGE)
> -			mask |= READ_ONCE(huge_anon_orders_madvise);
> -		if (hugepage_global_always() ||
> -		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> -			mask |= READ_ONCE(huge_anon_orders_inherit);
> -
> -		orders &= mask;
> -		if (!orders)
> -			return 0;
> -	}
> +	if (vma_is_anonymous(vma))
> +		orders = __thp_mask_anon_orders(vm_flags, tva_flags, orders);
>   
>   	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>   }
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 4341ce6b3b38..85bfff53dba6 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
>   
>   	printf("%s...", msg);
>   
> -	/*
> -	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
> -	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
> -	 */
> -	settings.thp_enabled = THP_NEVER;
> +	settings.thp_enabled = THP_ALWAYS;
>   	settings.shmem_enabled = SHMEM_NEVER;
>   	thp_push_settings(&settings);
>   
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Baolin Wang 3 months, 2 weeks ago

On 2025/6/24 16:41, Dev Jain wrote:
> 
> On 23/06/25 1:58 pm, Baolin Wang wrote:
>> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag 
>> is not
>> specified, we will ignore the THP sysfs settings. Whilst it makes 
>> sense for the
>> callers who do not specify this flag, it creates a odd and surprising 
>> situation
>> where a sysadmin specifying 'never' for all THP sizes still observing 
>> THP pages
>> being allocated and used on the system.
>>
>> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will 
>> ignore
>> the system-wide Anon THP sysfs settings, which means that even though 
>> we have
>> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt 
>> to collapse
>> into a Anon THP. This violates the rule we have agreed upon: never 
>> means never.
>>
>> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there 
>> is only
>> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
>> collapse_pte_mapped_thp() function, but I believe this is reasonable 
>> from its
>> comments:
>>
>> "
>> /*
>>   * If we are here, we've succeeded in replacing all the native pages
>>   * in the page cache with a single hugepage. If a mm were to fault-in
>>   * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>>   * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>>   * analogously elide sysfs THP settings here.
>>   */
>> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> 
> So the behaviour now is: First check whether THP settings converge to 
> never.
> Then, if enforce_sysfs is not set, return immediately. So in this 
> khugepaged
> code will it be better to call __thp_vma_allowable_orders()? If the sysfs
> settings are changed to never before hitting collapse_pte_mapped_thp(),
> then right now we will return SCAN_VMA_CHECK from here, whereas, the 
> comment
> says "regardless of sysfs THP settings", which should include "regardless
> of whether the sysfs settings say never".

Sounds reasonable to me. Thanks.

I will change thp_vma_allowable_order() to __thp_vma_allowable_orders() 
in the collapse_pte_mapped_thp() function to maintain consistency with 
the original logic.

Lorenzo and David, how do you think? Thanks.
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Baolin Wang 3 months, 2 weeks ago

On 2025/6/24 17:57, Baolin Wang wrote:
> 
> 
> On 2025/6/24 16:41, Dev Jain wrote:
>>
>> On 23/06/25 1:58 pm, Baolin Wang wrote:
>>> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag 
>>> is not
>>> specified, we will ignore the THP sysfs settings. Whilst it makes 
>>> sense for the
>>> callers who do not specify this flag, it creates a odd and surprising 
>>> situation
>>> where a sysadmin specifying 'never' for all THP sizes still observing 
>>> THP pages
>>> being allocated and used on the system.
>>>
>>> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will 
>>> ignore
>>> the system-wide Anon THP sysfs settings, which means that even though 
>>> we have
>>> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt 
>>> to collapse
>>> into a Anon THP. This violates the rule we have agreed upon: never 
>>> means never.
>>>
>>> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there 
>>> is only
>>> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
>>> collapse_pte_mapped_thp() function, but I believe this is reasonable 
>>> from its
>>> comments:
>>>
>>> "
>>> /*
>>>   * If we are here, we've succeeded in replacing all the native pages
>>>   * in the page cache with a single hugepage. If a mm were to fault-in
>>>   * this memory (mapped by a suitably aligned VMA), we'd get the 
>>> hugepage
>>>   * and map it by a PMD, regardless of sysfs THP settings. As such, 
>>> let's
>>>   * analogously elide sysfs THP settings here.
>>>   */
>>> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
>>
>> So the behaviour now is: First check whether THP settings converge to 
>> never.
>> Then, if enforce_sysfs is not set, return immediately. So in this 
>> khugepaged
>> code will it be better to call __thp_vma_allowable_orders()? If the sysfs
>> settings are changed to never before hitting collapse_pte_mapped_thp(),
>> then right now we will return SCAN_VMA_CHECK from here, whereas, the 
>> comment
>> says "regardless of sysfs THP settings", which should include "regardless
>> of whether the sysfs settings say never".
> 
> Sounds reasonable to me. Thanks.
> 
> I will change thp_vma_allowable_order() to __thp_vma_allowable_orders() 
> in the collapse_pte_mapped_thp() function to maintain consistency with 
> the original logic.
> 
> Lorenzo and David, how do you think? Thanks.

After thinking more, since collapse_pte_mapped_thp() is only used for 
file/shmem collapse, changing to __thp_vma_allowable_orders() has no 
effect. So I prefer to leave it as is.
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Dev Jain 3 months, 2 weeks ago
On 24/06/25 7:38 pm, Baolin Wang wrote:
>
>
> On 2025/6/24 17:57, Baolin Wang wrote:
>>
>>
>> On 2025/6/24 16:41, Dev Jain wrote:
>>>
>>> On 23/06/25 1:58 pm, Baolin Wang wrote:
>>>> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS 
>>>> flag is not
>>>> specified, we will ignore the THP sysfs settings. Whilst it makes 
>>>> sense for the
>>>> callers who do not specify this flag, it creates a odd and 
>>>> surprising situation
>>>> where a sysadmin specifying 'never' for all THP sizes still 
>>>> observing THP pages
>>>> being allocated and used on the system.
>>>>
>>>> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE 
>>>> will ignore
>>>> the system-wide Anon THP sysfs settings, which means that even 
>>>> though we have
>>>> disabled the Anon THP configuration, MADV_COLLAPSE will still 
>>>> attempt to collapse
>>>> into a Anon THP. This violates the rule we have agreed upon: never 
>>>> means never.
>>>>
>>>> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, 
>>>> there is only
>>>> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
>>>> collapse_pte_mapped_thp() function, but I believe this is 
>>>> reasonable from its
>>>> comments:
>>>>
>>>> "
>>>> /*
>>>>   * If we are here, we've succeeded in replacing all the native pages
>>>>   * in the page cache with a single hugepage. If a mm were to fault-in
>>>>   * this memory (mapped by a suitably aligned VMA), we'd get the 
>>>> hugepage
>>>>   * and map it by a PMD, regardless of sysfs THP settings. As such, 
>>>> let's
>>>>   * analogously elide sysfs THP settings here.
>>>>   */
>>>> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
>>>
>>> So the behaviour now is: First check whether THP settings converge 
>>> to never.
>>> Then, if enforce_sysfs is not set, return immediately. So in this 
>>> khugepaged
>>> code will it be better to call __thp_vma_allowable_orders()? If the 
>>> sysfs
>>> settings are changed to never before hitting collapse_pte_mapped_thp(),
>>> then right now we will return SCAN_VMA_CHECK from here, whereas, the 
>>> comment
>>> says "regardless of sysfs THP settings", which should include 
>>> "regardless
>>> of whether the sysfs settings say never".
>>
>> Sounds reasonable to me. Thanks.
>>
>> I will change thp_vma_allowable_order() to 
>> __thp_vma_allowable_orders() in the collapse_pte_mapped_thp() 
>> function to maintain consistency with the original logic.
>>
>> Lorenzo and David, how do you think? Thanks.
>
> After thinking more, since collapse_pte_mapped_thp() is only used for 
> file/shmem collapse, changing to __thp_vma_allowable_orders() has no 
> effect. So I prefer to leave it as is.


Oops my bad, thanks.

Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Zi Yan 3 months, 2 weeks ago
On 23 Jun 2025, at 4:28, Baolin Wang wrote:

> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
> callers who do not specify this flag, it creates a odd and surprising situation
> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
> being allocated and used on the system.
>
> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
> the system-wide Anon THP sysfs settings, which means that even though we have
> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
> into a Anon THP. This violates the rule we have agreed upon: never means never.
>
> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
> collapse_pte_mapped_thp() function, but I believe this is reasonable from its
> comments:
>
> "
> /*
>  * If we are here, we've succeeded in replacing all the native pages
>  * in the page cache with a single hugepage. If a mm were to fault-in
>  * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>  * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>  * analogously elide sysfs THP settings here.
>  */
> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> "
>
> Another rule for madvise, referring to David's suggestion: “allowing for
> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>
> To address this issue, the current strategy should be:
>
> If no hugepage modes are enabled for the desired orders, nor can we enable them
> by inheriting from a 'global' enabled setting - then it must be the case that
> all desired orders either specify or inherit 'NEVER' - and we must abort.
>
> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling
> THP.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>  tools/testing/selftests/mm/khugepaged.c |  6 +--
>  2 files changed, 39 insertions(+), 18 deletions(-)
>
The code looks much cleaner. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by David Hildenbrand 3 months, 2 weeks ago
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 4341ce6b3b38..85bfff53dba6 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
>   
>   	printf("%s...", msg);
>   
> -	/*
> -	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
> -	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
> -	 */
> -	settings.thp_enabled = THP_NEVER;
> +	settings.thp_enabled = THP_ALWAYS;


Would MADVISE mode also work here? If we don't set MADV_HUGEPAGE, then 
khugepaged should be excluded, correct?


-- 
Cheers,

David / dhildenb
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Baolin Wang 3 months, 2 weeks ago

On 2025/6/23 21:54, David Hildenbrand wrote:
> 
>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/ 
>> selftests/mm/khugepaged.c
>> index 4341ce6b3b38..85bfff53dba6 100644
>> --- a/tools/testing/selftests/mm/khugepaged.c
>> +++ b/tools/testing/selftests/mm/khugepaged.c
>> @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, 
>> char *p, int nr_hpages,
>>       printf("%s...", msg);
>> -    /*
>> -     * Prevent khugepaged interference and tests that MADV_COLLAPSE
>> -     * ignores /sys/kernel/mm/transparent_hugepage/enabled
>> -     */
>> -    settings.thp_enabled = THP_NEVER;
>> +    settings.thp_enabled = THP_ALWAYS;
> 
> 
> Would MADVISE mode also work here? If we don't set MADV_HUGEPAGE, then 
> khugepaged should be excluded, correct?

I tried this, but some test cases failed. As I replied to Barry, it's 
because some tests previously set MADV_NOHUGEPAGE, and now there is no 
way to clear the MADV_NOHUGEPAGE flag except by setting MADV_HUGEPAGE.
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by David Hildenbrand 3 months, 2 weeks ago
On 24.06.25 03:48, Baolin Wang wrote:
> 
> 
> On 2025/6/23 21:54, David Hildenbrand wrote:
>>
>>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/
>>> selftests/mm/khugepaged.c
>>> index 4341ce6b3b38..85bfff53dba6 100644
>>> --- a/tools/testing/selftests/mm/khugepaged.c
>>> +++ b/tools/testing/selftests/mm/khugepaged.c
>>> @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg,
>>> char *p, int nr_hpages,
>>>        printf("%s...", msg);
>>> -    /*
>>> -     * Prevent khugepaged interference and tests that MADV_COLLAPSE
>>> -     * ignores /sys/kernel/mm/transparent_hugepage/enabled
>>> -     */
>>> -    settings.thp_enabled = THP_NEVER;
>>> +    settings.thp_enabled = THP_ALWAYS;
>>
>>
>> Would MADVISE mode also work here? If we don't set MADV_HUGEPAGE, then
>> khugepaged should be excluded, correct?
> 
> I tried this, but some test cases failed. As I replied to Barry, it's
> because some tests previously set MADV_NOHUGEPAGE, and now there is no
> way to clear the MADV_NOHUGEPAGE flag except by setting MADV_HUGEPAGE.

Okay, can you add that detail to the patch description. I suspect we 
really want a way to undo what MADV_NOHUGEPAGE/MADV_NOHUGEPAGE did (if 
only naming wouldn't be complicated: MADV_DEFAULT_HUGEPAGE, hmmmm).

-- 
Cheers,

David / dhildenb

Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Baolin Wang 3 months, 2 weeks ago

On 2025/6/24 16:29, David Hildenbrand wrote:
> On 24.06.25 03:48, Baolin Wang wrote:
>>
>>
>> On 2025/6/23 21:54, David Hildenbrand wrote:
>>>
>>>> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/
>>>> selftests/mm/khugepaged.c
>>>> index 4341ce6b3b38..85bfff53dba6 100644
>>>> --- a/tools/testing/selftests/mm/khugepaged.c
>>>> +++ b/tools/testing/selftests/mm/khugepaged.c
>>>> @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg,
>>>> char *p, int nr_hpages,
>>>>        printf("%s...", msg);
>>>> -    /*
>>>> -     * Prevent khugepaged interference and tests that MADV_COLLAPSE
>>>> -     * ignores /sys/kernel/mm/transparent_hugepage/enabled
>>>> -     */
>>>> -    settings.thp_enabled = THP_NEVER;
>>>> +    settings.thp_enabled = THP_ALWAYS;
>>>
>>>
>>> Would MADVISE mode also work here? If we don't set MADV_HUGEPAGE, then
>>> khugepaged should be excluded, correct?
>>
>> I tried this, but some test cases failed. As I replied to Barry, it's
>> because some tests previously set MADV_NOHUGEPAGE, and now there is no
>> way to clear the MADV_NOHUGEPAGE flag except by setting MADV_HUGEPAGE.
> 
> Okay, can you add that detail to the patch description. 

Sure. Will do.

> I suspect we 
> really want a way to undo what MADV_NOHUGEPAGE/MADV_NOHUGEPAGE did (if 
> only naming wouldn't be complicated: MADV_DEFAULT_HUGEPAGE, hmmmm).
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Barry Song 3 months, 2 weeks ago
On Mon, Jun 23, 2025 at 8:28 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
> callers who do not specify this flag, it creates a odd and surprising situation
> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
> being allocated and used on the system.
>
> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
> the system-wide Anon THP sysfs settings, which means that even though we have
> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
> into a Anon THP. This violates the rule we have agreed upon: never means never.
>

Should we update the man page for madv_collapse ?
https://man7.org/linux/man-pages/man2/madvise.2.html

              MADV_COLLAPSE is independent of any sysfs (see sysfs(5))
              setting under /sys/kernel/mm/transparent_hugepage, both in
              terms of determining THP eligibility, and allocation
              semantics.  See Linux kernel source file
              Documentation/admin-guide/mm/transhuge.rst for more
              information.  MADV_COLLAPSE also ignores huge= tmpfs mount
              when operating on tmpfs files.  Allocation for the new
              hugepage may enter direct reclaim and/or compaction,
              regardless of VMA flags (though VM_NOHUGEPAGE is still
              respected).

So this effectively changes the uABI, right?

> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
> collapse_pte_mapped_thp() function, but I believe this is reasonable from its
> comments:
>
> "
> /*
>  * If we are here, we've succeeded in replacing all the native pages
>  * in the page cache with a single hugepage. If a mm were to fault-in
>  * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>  * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>  * analogously elide sysfs THP settings here.
>  */
> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> "
>
> Another rule for madvise, referring to David's suggestion: “allowing for
> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>
> To address this issue, the current strategy should be:
>
> If no hugepage modes are enabled for the desired orders, nor can we enable them
> by inheriting from a 'global' enabled setting - then it must be the case that
> all desired orders either specify or inherit 'NEVER' - and we must abort.
>
> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling
> THP.

It’s a bit odd that the old test case expects collapsing to succeed
even when we’ve set it
to ‘never’.
Setting it to ‘always’ doesn’t seem to test anything as a counterpart.

I assume the goal is to test that setting it to ‘never’ prevents collapsing?

>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Thanks
Barry
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Baolin Wang 3 months, 2 weeks ago

On 2025/6/23 19:08, Barry Song wrote:
> On Mon, Jun 23, 2025 at 8:28 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
>> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
>> callers who do not specify this flag, it creates a odd and surprising situation
>> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
>> being allocated and used on the system.
>>
>> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
>> the system-wide Anon THP sysfs settings, which means that even though we have
>> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
>> into a Anon THP. This violates the rule we have agreed upon: never means never.
>>
> 
> Should we update the man page for madv_collapse ?
> https://man7.org/linux/man-pages/man2/madvise.2.html
> 
>                MADV_COLLAPSE is independent of any sysfs (see sysfs(5))
>                setting under /sys/kernel/mm/transparent_hugepage, both in
>                terms of determining THP eligibility, and allocation
>                semantics.  See Linux kernel source file
>                Documentation/admin-guide/mm/transhuge.rst for more
>                information.  MADV_COLLAPSE also ignores huge= tmpfs mount
>                when operating on tmpfs files.  Allocation for the new
>                hugepage may enter direct reclaim and/or compaction,
>                regardless of VMA flags (though VM_NOHUGEPAGE is still
>                respected).
> 
> So this effectively changes the uABI, right?

Good point. Will update the man page.

>> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
>> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
>> collapse_pte_mapped_thp() function, but I believe this is reasonable from its
>> comments:
>>
>> "
>> /*
>>   * If we are here, we've succeeded in replacing all the native pages
>>   * in the page cache with a single hugepage. If a mm were to fault-in
>>   * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>>   * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>>   * analogously elide sysfs THP settings here.
>>   */
>> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
>> "
>>
>> Another rule for madvise, referring to David's suggestion: “allowing for
>> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>>
>> To address this issue, the current strategy should be:
>>
>> If no hugepage modes are enabled for the desired orders, nor can we enable them
>> by inheriting from a 'global' enabled setting - then it must be the case that
>> all desired orders either specify or inherit 'NEVER' - and we must abort.
>>
>> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling
>> THP.
> 
> It’s a bit odd that the old test case expects collapsing to succeed
> even when we’ve set it
> to ‘never’.
> Setting it to ‘always’ doesn’t seem to test anything as a counterpart.
> 
> I assume the goal is to test that setting it to ‘never’ prevents collapsing?

The original logic will prevent khugepaged by setting THP_NEVER, 
allowing only madvise_collapse() to perform THP collapse. And this is 
the logic this patchset tries to fix, which is to also prevent 
madvise_collapse() from performing THP collapse when system-wide THP 
sysfs settings are disabled.

Therefore, it should be changed to THP_ALWAYS here to allow 
madvise_collapse() to perform THP collapse.

Of course, the current logic cannot completely disable khugepaged, but I 
haven't found a better way to modify it. As David suggested, changing to 
MADVISE mode would cause some test cases to fail because some tests 
previously set MADV_NOHUGEPAGE, and now there is no other way to clear 
the MADV_NOHUGEPAGE flag except for setting MADV_HUGEPAGE. As a result, 
khugepaged cannot be completely disabled either.

So I think we should introduce a new method to clear MADV_NOHUGEPAGE 
flag without setting MADV_HUGEPAGE in the future.
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Lorenzo Stoakes 3 months, 2 weeks ago
On Mon, Jun 23, 2025 at 04:28:08PM +0800, Baolin Wang wrote:
> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
> callers who do not specify this flag, it creates a odd and surprising situation
> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
> being allocated and used on the system.
>
> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
> the system-wide Anon THP sysfs settings, which means that even though we have
> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
> into a Anon THP. This violates the rule we have agreed upon: never means never.
>
> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
> collapse_pte_mapped_thp() function, but I believe this is reasonable from its
> comments:
>
> "
> /*
>  * If we are here, we've succeeded in replacing all the native pages
>  * in the page cache with a single hugepage. If a mm were to fault-in
>  * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>  * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>  * analogously elide sysfs THP settings here.
>  */
> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> "
>
> Another rule for madvise, referring to David's suggestion: “allowing for
> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>
> To address this issue, the current strategy should be:
>
> If no hugepage modes are enabled for the desired orders, nor can we enable them
> by inheriting from a 'global' enabled setting - then it must be the case that
> all desired orders either specify or inherit 'NEVER' - and we must abort.
>
> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling
> THP.

Thanks! Sounds good.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Appreciate it though I'm not so bothered about attribution :) but just to say,
of course the 'never' stuff is David's idea (and a good one!) :)

> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

LGTM so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>  tools/testing/selftests/mm/khugepaged.c |  6 +--
>  2 files changed, 39 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4d5bb67dc4ec..ab70ca4e704b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -267,6 +267,42 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long tva_flags,
>  					 unsigned long orders);
>
> +/* Strictly mask requested anonymous orders according to sysfs settings. */
> +static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags,
> +		unsigned long tva_flags, unsigned long orders)
> +{
> +	const unsigned long always = READ_ONCE(huge_anon_orders_always);
> +	const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
> +	const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
> +	const unsigned long never = ~(always | madvise | inherit);
> +	const bool inherit_never = !hugepage_global_enabled();
> +
> +	/* Disallow orders that are set to NEVER directly ... */
> +	orders &= ~never;
> +
> +	/* ... or through inheritance (global == NEVER). */
> +	if (inherit_never)
> +		orders &= ~inherit;
> +
> +	/*
> +	 * Otherwise, we only enforce sysfs settings if asked. In addition,
> +	 * if the user sets a sysfs mode of madvise and if TVA_ENFORCE_SYSFS
> +	 * is not set, we don't bother checking whether the VMA has VM_HUGEPAGE
> +	 * set.
> +	 */
> +	if (!(tva_flags & TVA_ENFORCE_SYSFS))
> +		return orders;
> +
> +	/* We already excluded never inherit above. */
> +	if (vm_flags & VM_HUGEPAGE)
> +		return orders & (always | madvise | inherit);
> +
> +	if (hugepage_global_always())
> +		return orders & (always | inherit);
> +
> +	return orders & always;
> +}
> +
>  /**
>   * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>   * @vma:  the vm area to check
> @@ -289,19 +325,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  				       unsigned long orders)
>  {
>  	/* Optimization to check if required orders are enabled early. */
> -	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> -		unsigned long mask = READ_ONCE(huge_anon_orders_always);
> -
> -		if (vm_flags & VM_HUGEPAGE)
> -			mask |= READ_ONCE(huge_anon_orders_madvise);
> -		if (hugepage_global_always() ||
> -		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> -			mask |= READ_ONCE(huge_anon_orders_inherit);
> -
> -		orders &= mask;
> -		if (!orders)
> -			return 0;
> -	}
> +	if (vma_is_anonymous(vma))
> +		orders = __thp_mask_anon_orders(vm_flags, tva_flags, orders);
>
>  	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>  }
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 4341ce6b3b38..85bfff53dba6 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
>
>  	printf("%s...", msg);
>
> -	/*
> -	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
> -	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
> -	 */
> -	settings.thp_enabled = THP_NEVER;
> +	settings.thp_enabled = THP_ALWAYS;

Good spot!

>  	settings.shmem_enabled = SHMEM_NEVER;
>  	thp_push_settings(&settings);
>
> --
> 2.43.5
>
Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
Posted by Baolin Wang 3 months, 2 weeks ago

On 2025/6/23 18:26, Lorenzo Stoakes wrote:
> On Mon, Jun 23, 2025 at 04:28:08PM +0800, Baolin Wang wrote:
>> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
>> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
>> callers who do not specify this flag, it creates a odd and surprising situation
>> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
>> being allocated and used on the system.
>>
>> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
>> the system-wide Anon THP sysfs settings, which means that even though we have
>> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
>> into a Anon THP. This violates the rule we have agreed upon: never means never.
>>
>> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
>> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
>> collapse_pte_mapped_thp() function, but I believe this is reasonable from its
>> comments:
>>
>> "
>> /*
>>   * If we are here, we've succeeded in replacing all the native pages
>>   * in the page cache with a single hugepage. If a mm were to fault-in
>>   * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>>   * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>>   * analogously elide sysfs THP settings here.
>>   */
>> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
>> "
>>
>> Another rule for madvise, referring to David's suggestion: “allowing for
>> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>>
>> To address this issue, the current strategy should be:
>>
>> If no hugepage modes are enabled for the desired orders, nor can we enable them
>> by inheriting from a 'global' enabled setting - then it must be the case that
>> all desired orders either specify or inherit 'NEVER' - and we must abort.
>>
>> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling
>> THP.
> 
> Thanks! Sounds good.
>>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> Appreciate it though I'm not so bothered about attribution :) but just to say,
> of course the 'never' stuff is David's idea (and a good one!) :)

Yes, I should also add:

Suggested-by: David Hildenbrand <david@redhat.com>

>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> LGTM so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks.