include/linux/huge_mm.h | 23 +++++++++++++++++++---- mm/huge_memory.c | 2 +- mm/shmem.c | 6 +++--- 3 files changed, 23 insertions(+), 8 deletions(-)
As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
the system-wide anon/shmem THP sysfs settings, which means that even though
we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
attempt to collapse into a anon/shmem THP. This violates the rule we have
agreed upon: never means never. This patch set will address this issue.
[1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
Changes from v1:
- Update the commit message, per Zi.
- Add Zi's reviewed tag. Thanks.
- Update the shmem logic.
Baolin Wang (2):
mm: huge_memory: disallow hugepages if the system-wide THP sysfs
settings are disabled
mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
settings are disabled
include/linux/huge_mm.h | 23 +++++++++++++++++++----
mm/huge_memory.c | 2 +-
mm/shmem.c | 6 +++---
3 files changed, 23 insertions(+), 8 deletions(-)
--
2.43.5
On 05/06/2025 09:00, Baolin Wang wrote: > As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore > the system-wide anon/shmem THP sysfs settings, which means that even though > we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still > attempt to collapse into a anon/shmem THP. This violates the rule we have > agreed upon: never means never. This patch set will address this issue. Hi Baolin, I know never means never, but I also thought that the per-size toggles had priority over the system ones. This was discussed in [1] as well. My understanding with these patches is that if we have: [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled always inherit [madvise] never Than without these patches we get a hugepage when we do MADV_HUGEPAGE, but with these we won't get a hugepage anymore eventhough hugepages-2048kB/enabled is set to madvise? I know this isn't ABI, but this would break existing expectations. (For e.g. we have certain 64K page size arm machines with global enabled = never and 2M = madvise, and we want 2M hugepages to fault at madvise). If the whole thing was being implemented from scratch, we should have definitely done it this way, but this can give a people a nasty surprise when they upgrade the kernel and suddenly stop getting hugepages. [1] https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/ Thanks, Usama
On Fri, Jun 13, 2025 at 03:23:19PM +0100, Usama Arif wrote:
>
>
> On 05/06/2025 09:00, Baolin Wang wrote:
> > As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> > the system-wide anon/shmem THP sysfs settings, which means that even though
> > we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> > attempt to collapse into a anon/shmem THP. This violates the rule we have
> > agreed upon: never means never. This patch set will address this issue.
>
> Hi Baolin,
>
> I know never means never, but I also thought that the per-size toggles had
> priority over the system ones. This was discussed in [1] as well.
>
> My understanding with these patches is that if we have:
>
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> always inherit [madvise] never
>
> Than without these patches we get a hugepage when we do MADV_HUGEPAGE, but with
> these we won't get a hugepage anymore eventhough hugepages-2048kB/enabled is set
> to madvise?
This isn't correct, madvise at a specific pagesize will still be permitted for
MADV_COLLAPSE.
In current contender for this patch:
/* Strictly mask requested anonymous orders according to sysfs settings. */
static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags,
unsigned long tva_flags, unsigned long orders)
{
const unsigned long always = READ_ONCE(huge_anon_orders_always);
const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);;
const unsigned long never = ~(always | madvise | inherit);
Note that madvise is considered here.
const bool inherit_never = !hugepage_global_enabled();
/* Disallow orders that are set to NEVER directly ... */
orders &= ~never;
/* ... or through inheritance (global == NEVER). */
if (inherit_never)
orders &= ~inherit;
/*
* Otherwise, we only enforce sysfs settings if asked. In addition,
* if the user sets a sysfs mode of madvise and if TVA_ENFORCE_SYSFS
* is not set, we don't bother checking whether the VMA has VM_HUGEPAGE
* set.
*/
if (!(tva_flags & TVA_ENFORCE_SYSFS))
return orders;
And then if !TVA_ENFORCE_SYSFS (e.g. MADV_COLLAPSE case), an madvise should
succeed fine.
>
> I know this isn't ABI, but this would break existing expectations.
> (For e.g. we have certain 64K page size arm machines with global enabled = never and
> 2M = madvise, and we want 2M hugepages to fault at madvise).
> If the whole thing was being implemented from scratch, we should have definitely
> done it this way, but this can give a people a nasty surprise when they upgrade
> the kernel and suddenly stop getting hugepages.
>
> [1] https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
>
> Thanks,
> Usama
On 13/06/2025 15:29, Lorenzo Stoakes wrote:
> On Fri, Jun 13, 2025 at 03:23:19PM +0100, Usama Arif wrote:
>>
>>
>> On 05/06/2025 09:00, Baolin Wang wrote:
>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>> agreed upon: never means never. This patch set will address this issue.
>>
>> Hi Baolin,
>>
>> I know never means never, but I also thought that the per-size toggles had
>> priority over the system ones. This was discussed in [1] as well.
>>
>> My understanding with these patches is that if we have:
>>
>> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled
>> always madvise [never]
>> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>> always inherit [madvise] never
>>
>> Than without these patches we get a hugepage when we do MADV_HUGEPAGE, but with
>> these we won't get a hugepage anymore eventhough hugepages-2048kB/enabled is set
>> to madvise?
>
> This isn't correct, madvise at a specific pagesize will still be permitted for
> MADV_COLLAPSE.
>
> In current contender for this patch:
>
> /* Strictly mask requested anonymous orders according to sysfs settings. */
> static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags,
> unsigned long tva_flags, unsigned long orders)
> {
> const unsigned long always = READ_ONCE(huge_anon_orders_always);
> const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
> const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);;
> const unsigned long never = ~(always | madvise | inherit);
>
> Note that madvise is considered here.
>
Ah ok, Thanks for clearing that! I was reviewing the original patch in [1] but I
see this version in the replies.
I wish this function was simpler :) or maybe its me that takes so much time
to figure out if the order will be set or not by the end of the function.
[1] https://lore.kernel.org/all/8eefb0809c598fadaa4a022634fba5689a4f3257.1749109709.git.baolin.wang@linux.alibaba.com/
Thanks!
Usama
On Fri, Jun 13, 2025 at 03:39:33PM +0100, Usama Arif wrote:
>
>
> On 13/06/2025 15:29, Lorenzo Stoakes wrote:
> > On Fri, Jun 13, 2025 at 03:23:19PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 05/06/2025 09:00, Baolin Wang wrote:
> >>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> >>> the system-wide anon/shmem THP sysfs settings, which means that even though
> >>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> >>> attempt to collapse into a anon/shmem THP. This violates the rule we have
> >>> agreed upon: never means never. This patch set will address this issue.
> >>
> >> Hi Baolin,
> >>
> >> I know never means never, but I also thought that the per-size toggles had
> >> priority over the system ones. This was discussed in [1] as well.
> >>
> >> My understanding with these patches is that if we have:
> >>
> >> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled
> >> always madvise [never]
> >> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >> always inherit [madvise] never
> >>
> >> Than without these patches we get a hugepage when we do MADV_HUGEPAGE, but with
> >> these we won't get a hugepage anymore eventhough hugepages-2048kB/enabled is set
> >> to madvise?
> >
> > This isn't correct, madvise at a specific pagesize will still be permitted for
> > MADV_COLLAPSE.
> >
> > In current contender for this patch:
> >
> > /* Strictly mask requested anonymous orders according to sysfs settings. */
> > static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags,
> > unsigned long tva_flags, unsigned long orders)
> > {
> > const unsigned long always = READ_ONCE(huge_anon_orders_always);
> > const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
> > const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);;
> > const unsigned long never = ~(always | madvise | inherit);
> >
> > Note that madvise is considered here.
> >
>
> Ah ok, Thanks for clearing that! I was reviewing the original patch in [1] but I
> see this version in the replies.
>
> I wish this function was simpler :) or maybe its me that takes so much time
> to figure out if the order will be set or not by the end of the function.
Couldn't agree more... this whole thing needs major work, it's massively confusing
>
> [1] https://lore.kernel.org/all/8eefb0809c598fadaa4a022634fba5689a4f3257.1749109709.git.baolin.wang@linux.alibaba.com/
>
> Thanks!
> Usama
>
Before I get into technical criticism, to be clear - thank you very much for doing this :) I'm just getting into details as to the implementation, but am a fan of this change and consider it important. On Thu, Jun 05, 2025 at 04:00:57PM +0800, Baolin Wang wrote: > As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore > the system-wide anon/shmem THP sysfs settings, which means that even though > we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still > attempt to collapse into a anon/shmem THP. This violates the rule we have > agreed upon: never means never. This patch set will address this issue. Hm this cover letter could be expanded upon quite a bit - you are doing a lot here and it's not only MADV_COLLAPSE, more a general change. I'd mention that, even when TVA_ENFORCE_SYSFS is not set, callers checking THP order validity will not be able to specify THP orders that are either specifically marked as 'never' or set to 'inherit' and the global hugepage mode is 'never'. Then say something like 'importantly, this changes alters the madvise(..., MADV_COLLAPSE) call, which previously would collapse ranges into huge pages even if THP was set to never. This corrects this behaviour'. I suspect you are unable to write sensible tests here given the need to manipulate sysfs (though perhaps worth quickly looking at tools/testing/selftests/mm/khugepaged.c, transhuge-stress.c, run_vmtests.sh to see), but it'd be at least useful for you to give details here of how you have tested this and ensured it functions correctly. It might also be worth giving a quick justification, i.e. 'system administrators who disabled THP everywhere must indeed very much not want THP to be used for whatever reason - having individual programs being able to quietly override this is very surprising and likely to cause headaches for those who desire this not to happen on their systems'. > > [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/ > > Changes from v1: > - Update the commit message, per Zi. > - Add Zi's reviewed tag. Thanks. > - Update the shmem logic. > > Baolin Wang (2): > mm: huge_memory: disallow hugepages if the system-wide THP sysfs > settings are disabled > mm: shmem: disallow hugepages if the system-wide shmem THP sysfs > settings are disabled > > include/linux/huge_mm.h | 23 +++++++++++++++++++---- > mm/huge_memory.c | 2 +- > mm/shmem.c | 6 +++--- > 3 files changed, 23 insertions(+), 8 deletions(-) > > -- > 2.43.5 > Thanks, Lorenzo
On 2025/6/7 20:28, Lorenzo Stoakes wrote: > Before I get into technical criticism, to be clear - thank you very much > for doing this :) I'm just getting into details as to the implementation, > but am a fan of this change and consider it important. > > On Thu, Jun 05, 2025 at 04:00:57PM +0800, Baolin Wang wrote: >> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore >> the system-wide anon/shmem THP sysfs settings, which means that even though >> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still >> attempt to collapse into a anon/shmem THP. This violates the rule we have >> agreed upon: never means never. This patch set will address this issue. > > Hm this cover letter could be expanded upon quite a bit - you are doing a > lot here and it's not only MADV_COLLAPSE, more a general change. > > I'd mention that, even when TVA_ENFORCE_SYSFS is not set, callers checking > THP order validity will not be able to specify THP orders that are either > specifically marked as 'never' or set to 'inherit' and the global hugepage > mode is 'never'. > > Then say something like 'importantly, this changes alters the madvise(..., > MADV_COLLAPSE) call, which previously would collapse ranges into huge pages > even if THP was set to never. This corrects this behaviour'. > > I suspect you are unable to write sensible tests here given the need to > manipulate sysfs (though perhaps worth quickly looking at > tools/testing/selftests/mm/khugepaged.c, transhuge-stress.c, run_vmtests.sh > to see), but it'd be at least useful for you to give details here of how > you have tested this and ensured it functions correctly. > > It might also be worth giving a quick justification, i.e. 'system > administrators who disabled THP everywhere must indeed very much not want > THP to be used for whatever reason - having individual programs being able > to quietly override this is very surprising and likely to cause headaches > for those who desire this not to happen on their systems'. > Ah, missed this comment. Good suggestion, I will update the cover letter. Thanks.
© 2016 - 2025 Red Hat, Inc.