[v1] fix MADV_COLLAPSE issue if THP settings are disabled

[PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Baolin Wang 8 months, 2 weeks ago

As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
the system-wide anon/shmem THP sysfs settings, which means that even though
we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
attempt to collapse into a anon/shmem THP. This violates the rule we have
agreed upon: never means never. This patch set will address this issue.

[1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/

Baolin Wang (2):
  mm: huge_memory: disallow hugepages if the system-wide THP sysfs
    settings are disabled
  mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
    settings are disabled

 include/linux/huge_mm.h | 23 +++++++++++++++++++----
 mm/huge_memory.c        |  2 +-
 mm/shmem.c              | 12 ++++++------
 3 files changed, 26 insertions(+), 11 deletions(-)

-- 
2.43.5

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Ryan Roberts 8 months, 1 week ago

On 29/05/2025 09:23, Baolin Wang wrote:
> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> the system-wide anon/shmem THP sysfs settings, which means that even though
> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> attempt to collapse into a anon/shmem THP. This violates the rule we have
> agreed upon: never means never. This patch set will address this issue.

This is a drive-by comment from me without having the previous context, but...

Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
user-initiated, synchonous request to use huge pages for a range of memory.
There is nothing *transparent* about it, it just happens to be implemented using
the same logic that THP uses.

I always thought this was a deliberate design decision.

Thanks,
Ryan

> 
> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
> 
> Baolin Wang (2):
>   mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>     settings are disabled
>   mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>     settings are disabled
> 
>  include/linux/huge_mm.h | 23 +++++++++++++++++++----
>  mm/huge_memory.c        |  2 +-
>  mm/shmem.c              | 12 ++++++------
>  3 files changed, 26 insertions(+), 11 deletions(-)
>

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by David Hildenbrand 8 months, 1 week ago

On 30.05.25 10:04, Ryan Roberts wrote:
> On 29/05/2025 09:23, Baolin Wang wrote:
>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>> the system-wide anon/shmem THP sysfs settings, which means that even though
>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>> agreed upon: never means never. This patch set will address this issue.
> 
> This is a drive-by comment from me without having the previous context, but...
> 
> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
> user-initiated, synchonous request to use huge pages for a range of memory.
> There is nothing *transparent* about it, it just happens to be implemented using
> the same logic that THP uses.
> 
> I always thought this was a deliberate design decision.

If the admin said "never", then why should a user be able to overwrite that?

The design decision I recall is that if VM_NOHUGEPAGE is set, we'll 
ignore that. Because that was set by the app itself (MADV_NOHUEPAGE).

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Ryan Roberts 8 months, 1 week ago

On 30/05/2025 09:44, David Hildenbrand wrote:
> On 30.05.25 10:04, Ryan Roberts wrote:
>> On 29/05/2025 09:23, Baolin Wang wrote:
>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>> agreed upon: never means never. This patch set will address this issue.
>>
>> This is a drive-by comment from me without having the previous context, but...
>>
>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>> user-initiated, synchonous request to use huge pages for a range of memory.
>> There is nothing *transparent* about it, it just happens to be implemented using
>> the same logic that THP uses.
>>
>> I always thought this was a deliberate design decision.
> 
> If the admin said "never", then why should a user be able to overwrite that?

Well my interpretation would be that the admin is saying never *transparently*
give anyone any hugepages; on balance it does more harm than good for my
workloads. The toggle is called transparent_hugepage/enabled, after all.

Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
opportune moment in time, presumably because the user knows that the region
*will* benefit and because that point in the execution is not sensitive to latency.

I see them as logically separate.

> 
> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore that.
> Because that was set by the app itself (MADV_NOHUEPAGE).

Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE means "I
don't want the risk of latency spikes and memory bloat that THP can cause". Not
"ignore my explicit requests to MADV_COLLAPSE".

But if that descision was already taken and that's the current behavior then I
agree we have an inconsistency with respect to the sysfs control.

Perhaps we should be guided by real world usage - AIUI there is a cloud that
disables THP at system level today (Google?). Is there any concern that there
are workloads in such environments that are using MADV_COLLAPSE today that would
then see a performance drop?

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by David Hildenbrand 8 months, 1 week ago

On 30.05.25 10:59, Ryan Roberts wrote:
> On 30/05/2025 09:44, David Hildenbrand wrote:
>> On 30.05.25 10:04, Ryan Roberts wrote:
>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>> agreed upon: never means never. This patch set will address this issue.
>>>
>>> This is a drive-by comment from me without having the previous context, but...
>>>
>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>> There is nothing *transparent* about it, it just happens to be implemented using
>>> the same logic that THP uses.
>>>
>>> I always thought this was a deliberate design decision.
>>
>> If the admin said "never", then why should a user be able to overwrite that?
> 
> Well my interpretation would be that the admin is saying never *transparently*
> give anyone any hugepages; on balance it does more harm than good for my
> workloads. The toggle is called transparent_hugepage/enabled, after all.

I'd say it's "enabling transparent huge pages" not "transparently 
enabling huge pages". After all, these things are ... transparent huge 
pages.

But yeah, it's confusing.

> 
> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
> opportune moment in time, presumably because the user knows that the region
> *will* benefit and because that point in the execution is not sensitive to latency.

Not sure if MADV_HUGEPAGE is really *that* different.

> 
> I see them as logically separate.
> 
>>
>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore that.
>> Because that was set by the app itself (MADV_NOHUEPAGE).
> 
> Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE means "I
> don't want the risk of latency spikes and memory bloat that THP can cause". Not
> "ignore my explicit requests to MADV_COLLAPSE".
> 
> But if that descision was already taken and that's the current behavior then I
> agree we have an inconsistency with respect to the sysfs control.
> 
> Perhaps we should be guided by real world usage - AIUI there is a cloud that
> disables THP at system level today (Google?).
The use case I am aware of for disabling it for debugging purposes. 
Saved us quite some headake in the past at customer sites for 
troubleshooting + workarounds ...


Let's take a look at the man page:

MADV_COLLAPSE is  independent  of  any  sysfs  (see  sysfs(5))  setting 
under  /sys/kernel/mm/transparent_hugepage, both in terms of determining 
THP eligibility, and allocation semantics.

I recall we discussed that it should ignore the max_ptes_none/swap/shared.

But "any" setting would include "enable" ...

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by David Hildenbrand 8 months, 1 week ago

On 30.05.25 11:10, David Hildenbrand wrote:
> On 30.05.25 10:59, Ryan Roberts wrote:
>> On 30/05/2025 09:44, David Hildenbrand wrote:
>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>>> agreed upon: never means never. This patch set will address this issue.
>>>>
>>>> This is a drive-by comment from me without having the previous context, but...
>>>>
>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>>> There is nothing *transparent* about it, it just happens to be implemented using
>>>> the same logic that THP uses.
>>>>
>>>> I always thought this was a deliberate design decision.
>>>
>>> If the admin said "never", then why should a user be able to overwrite that?
>>
>> Well my interpretation would be that the admin is saying never *transparently*
>> give anyone any hugepages; on balance it does more harm than good for my
>> workloads. The toggle is called transparent_hugepage/enabled, after all.
> 
> I'd say it's "enabling transparent huge pages" not "transparently
> enabling huge pages". After all, these things are ... transparent huge
> pages.
> 
> But yeah, it's confusing.
> 
>>
>> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
>> opportune moment in time, presumably because the user knows that the region
>> *will* benefit and because that point in the execution is not sensitive to latency.
> 
> Not sure if MADV_HUGEPAGE is really *that* different.
> 
>>
>> I see them as logically separate.
>>
>>>
>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore that.
>>> Because that was set by the app itself (MADV_NOHUEPAGE).
>>
>> Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE means "I
>> don't want the risk of latency spikes and memory bloat that THP can cause". Not
>> "ignore my explicit requests to MADV_COLLAPSE".
>>
>> But if that descision was already taken and that's the current behavior then I
>> agree we have an inconsistency with respect to the sysfs control.
>>
>> Perhaps we should be guided by real world usage - AIUI there is a cloud that
>> disables THP at system level today (Google?).
> The use case I am aware of for disabling it for debugging purposes.
> Saved us quite some headake in the past at customer sites for
> troubleshooting + workarounds ...
> 
> 
> Let's take a look at the man page:
> 
> MADV_COLLAPSE is  independent  of  any  sysfs  (see  sysfs(5))  setting
> under  /sys/kernel/mm/transparent_hugepage, both in terms of determining
> THP eligibility, and allocation semantics.
> 
> I recall we discussed that it should ignore the max_ptes_none/swap/shared.
> 
> But "any" setting would include "enable" ...

It kind-of contradicts the linked 
Documentation/admin-guide/mm/transhuge.rst, where we have this 
*beautiful* comment

"Transparent Hugepage Support for anonymous memory can be entirely 
disable (mostly for debugging purposes".

I mean, "entirely" is also pretty clear to me.

I would assume that the man page of MADV_COLLAPSE should have talked 
about ignoring *khugepaged* toggles (max_ptes_none ...), at least that's 
what I recall from the discussions back then.

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Baolin Wang 8 months, 1 week ago


On 2025/5/30 17:16, David Hildenbrand wrote:
> On 30.05.25 11:10, David Hildenbrand wrote:
>> On 30.05.25 10:59, Ryan Roberts wrote:
>>> On 30/05/2025 09:44, David Hildenbrand wrote:
>>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will 
>>>>>> ignore
>>>>>> the system-wide anon/shmem THP sysfs settings, which means that 
>>>>>> even though
>>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE 
>>>>>> will still
>>>>>> attempt to collapse into a anon/shmem THP. This violates the rule 
>>>>>> we have
>>>>>> agreed upon: never means never. This patch set will address this 
>>>>>> issue.
>>>>>
>>>>> This is a drive-by comment from me without having the previous 
>>>>> context, but...
>>>>>
>>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a 
>>>>> deliberate
>>>>> user-initiated, synchonous request to use huge pages for a range of 
>>>>> memory.
>>>>> There is nothing *transparent* about it, it just happens to be 
>>>>> implemented using
>>>>> the same logic that THP uses.
>>>>>
>>>>> I always thought this was a deliberate design decision.
>>>>
>>>> If the admin said "never", then why should a user be able to 
>>>> overwrite that?
>>>
>>> Well my interpretation would be that the admin is saying never 
>>> *transparently*
>>> give anyone any hugepages; on balance it does more harm than good for my
>>> workloads. The toggle is called transparent_hugepage/enabled, after all.
>>
>> I'd say it's "enabling transparent huge pages" not "transparently
>> enabling huge pages". After all, these things are ... transparent huge
>> pages.
>>
>> But yeah, it's confusing.
>>
>>>
>>> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
>>> opportune moment in time, presumably because the user knows that the 
>>> region
>>> *will* benefit and because that point in the execution is not 
>>> sensitive to latency.
>>
>> Not sure if MADV_HUGEPAGE is really *that* different.
>>
>>>
>>> I see them as logically separate.
>>>
>>>>
>>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll 
>>>> ignore that.
>>>> Because that was set by the app itself (MADV_NOHUEPAGE).

IIUC, MADV_COLLAPSE does not ignore the VM_NOHUGEPAGE setting, if we set 
VM_NOHUGEPAGE, then MADV_COLLAPSE will not be allowed to collapse a THP. 
See:
__thp_vma_allowable_orders() ---> vma_thp_disabled()

>>> Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE 
>>> means "I
>>> don't want the risk of latency spikes and memory bloat that THP can 
>>> cause". Not
>>> "ignore my explicit requests to MADV_COLLAPSE".
>>>
>>> But if that descision was already taken and that's the current 
>>> behavior then I
>>> agree we have an inconsistency with respect to the sysfs control.
>>>
>>> Perhaps we should be guided by real world usage - AIUI there is a 
>>> cloud that
>>> disables THP at system level today (Google?).
>> The use case I am aware of for disabling it for debugging purposes.
>> Saved us quite some headake in the past at customer sites for
>> troubleshooting + workarounds ...
>>
>>
>> Let's take a look at the man page:
>>
>> MADV_COLLAPSE is  independent  of  any  sysfs  (see  sysfs(5))  setting
>> under  /sys/kernel/mm/transparent_hugepage, both in terms of determining
>> THP eligibility, and allocation semantics.
>>
>> I recall we discussed that it should ignore the 
>> max_ptes_none/swap/shared.
>>
>> But "any" setting would include "enable" ...
> 
> It kind-of contradicts the linked 
> Documentation/admin-guide/mm/transhuge.rst, where we have this 
> *beautiful* comment
> 
> "Transparent Hugepage Support for anonymous memory can be entirely 
> disable (mostly for debugging purposes".
> 
> I mean, "entirely" is also pretty clear to me.

Yes, agree. We have encountered issues caused by THP in our Alibaba 
fleet. The quickest way to stop the bleeding was to disable THP. In such 
case, we do not expect MADV_HUGEPAGE to still collapse a THP.

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by David Hildenbrand 8 months, 1 week ago

On 30.05.25 11:52, Baolin Wang wrote:
> 
> 
> On 2025/5/30 17:16, David Hildenbrand wrote:
>> On 30.05.25 11:10, David Hildenbrand wrote:
>>> On 30.05.25 10:59, Ryan Roberts wrote:
>>>> On 30/05/2025 09:44, David Hildenbrand wrote:
>>>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will
>>>>>>> ignore
>>>>>>> the system-wide anon/shmem THP sysfs settings, which means that
>>>>>>> even though
>>>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE
>>>>>>> will still
>>>>>>> attempt to collapse into a anon/shmem THP. This violates the rule
>>>>>>> we have
>>>>>>> agreed upon: never means never. This patch set will address this
>>>>>>> issue.
>>>>>>
>>>>>> This is a drive-by comment from me without having the previous
>>>>>> context, but...
>>>>>>
>>>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a
>>>>>> deliberate
>>>>>> user-initiated, synchonous request to use huge pages for a range of
>>>>>> memory.
>>>>>> There is nothing *transparent* about it, it just happens to be
>>>>>> implemented using
>>>>>> the same logic that THP uses.
>>>>>>
>>>>>> I always thought this was a deliberate design decision.
>>>>>
>>>>> If the admin said "never", then why should a user be able to
>>>>> overwrite that?
>>>>
>>>> Well my interpretation would be that the admin is saying never
>>>> *transparently*
>>>> give anyone any hugepages; on balance it does more harm than good for my
>>>> workloads. The toggle is called transparent_hugepage/enabled, after all.
>>>
>>> I'd say it's "enabling transparent huge pages" not "transparently
>>> enabling huge pages". After all, these things are ... transparent huge
>>> pages.
>>>
>>> But yeah, it's confusing.
>>>
>>>>
>>>> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
>>>> opportune moment in time, presumably because the user knows that the
>>>> region
>>>> *will* benefit and because that point in the execution is not
>>>> sensitive to latency.
>>>
>>> Not sure if MADV_HUGEPAGE is really *that* different.
>>>
>>>>
>>>> I see them as logically separate.
>>>>
>>>>>
>>>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll
>>>>> ignore that.
>>>>> Because that was set by the app itself (MADV_NOHUEPAGE).
> 
> IIUC, MADV_COLLAPSE does not ignore the VM_NOHUGEPAGE setting, if we set
> VM_NOHUGEPAGE, then MADV_COLLAPSE will not be allowed to collapse a THP.
> See:
> __thp_vma_allowable_orders() ---> vma_thp_disabled()

Interesting, maybe I misremember things.

Maybe because process_madvise() could try MADV_COLLAPSE on a different 
process. And if that process as VM_NOHUGEPAGE set, it could be problematic.

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Lorenzo Stoakes 8 months, 1 week ago

On Fri, May 30, 2025 at 11:16:51AM +0200, David Hildenbrand wrote:
[snip]
> It kind-of contradicts the linked
> Documentation/admin-guide/mm/transhuge.rst, where we have this *beautiful*
> comment
>
> "Transparent Hugepage Support for anonymous memory can be entirely disable
> (mostly for debugging purposes".
>
> I mean, "entirely" is also pretty clear to me.
>
> I would assume that the man page of MADV_COLLAPSE should have talked about
> ignoring *khugepaged* toggles (max_ptes_none ...), at least that's what I
> recall from the discussions back then.

Sorry I don't want to turn this stuff into too much of a mega-thread but
just a small comment here - I think we should go and update the
documentation/man pages to be clearer and consistent.

There is enough confusion around this as it is...

>
> --
> Cheers,
>
> David / dhildenb
>

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Lorenzo Stoakes 8 months, 1 week ago

On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
> On 30.05.25 10:04, Ryan Roberts wrote:
> > On 29/05/2025 09:23, Baolin Wang wrote:
> > > As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> > > the system-wide anon/shmem THP sysfs settings, which means that even though
> > > we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> > > attempt to collapse into a anon/shmem THP. This violates the rule we have
> > > agreed upon: never means never. This patch set will address this issue.
> >
> > This is a drive-by comment from me without having the previous context, but...
> >
> > Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
> > user-initiated, synchonous request to use huge pages for a range of memory.
> > There is nothing *transparent* about it, it just happens to be implemented using
> > the same logic that THP uses.
> >
> > I always thought this was a deliberate design decision.
>
> If the admin said "never", then why should a user be able to overwrite that?
>
> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
> that. Because that was set by the app itself (MADV_NOHUEPAGE).
>

I'm with David on this one.

I think it's principal of least surprise - to me 'never' is pretty
emphatic, and keep in mind the other choices are 'always' and...  'madvise'
:) which explicitly is 'hey only do this if madvise tells you to'.

I'd be rather surprised if I hadn't set madvise and a user uses madvise to
in some fashion override the never.

I mean I think we all agree this interface is to use a technical term -
crap - and we need something a lot more fine-grained and smart, but I think
given the situation we're in we should make it at least as least surprising
as possible.

> --
> Cheers,
>
> David / dhildenb
>

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by David Hildenbrand 8 months, 1 week ago

On 30.05.25 10:47, Lorenzo Stoakes wrote:
> On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
>> On 30.05.25 10:04, Ryan Roberts wrote:
>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>> agreed upon: never means never. This patch set will address this issue.
>>>
>>> This is a drive-by comment from me without having the previous context, but...
>>>
>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>> There is nothing *transparent* about it, it just happens to be implemented using
>>> the same logic that THP uses.
>>>
>>> I always thought this was a deliberate design decision.
>>
>> If the admin said "never", then why should a user be able to overwrite that?
>>
>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
>> that. Because that was set by the app itself (MADV_NOHUEPAGE).
>>
> 
> I'm with David on this one.
> 
> I think it's principal of least surprise - to me 'never' is pretty
> emphatic, and keep in mind the other choices are 'always' and...  'madvise'
> :) which explicitly is 'hey only do this if madvise tells you to'.
> 
> I'd be rather surprised if I hadn't set madvise and a user uses madvise to
> in some fashion override the never.
> 
> I mean I think we all agree this interface is to use a technical term -
> crap - and we need something a lot more fine-grained and smart, but I think
> given the situation we're in we should make it at least as least surprising
> as possible.

Yes. If you configure "never" you are supposed to suffer, consistently.

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Ryan Roberts 8 months, 1 week ago

On 30/05/2025 09:52, David Hildenbrand wrote:
> On 30.05.25 10:47, Lorenzo Stoakes wrote:
>> On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>>> agreed upon: never means never. This patch set will address this issue.
>>>>
>>>> This is a drive-by comment from me without having the previous context, but...
>>>>
>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>>> There is nothing *transparent* about it, it just happens to be implemented
>>>> using
>>>> the same logic that THP uses.
>>>>
>>>> I always thought this was a deliberate design decision.
>>>
>>> If the admin said "never", then why should a user be able to overwrite that?
>>>
>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
>>> that. Because that was set by the app itself (MADV_NOHUEPAGE).
>>>
>>
>> I'm with David on this one.
>>
>> I think it's principal of least surprise - to me 'never' is pretty
>> emphatic, and keep in mind the other choices are 'always' and...  'madvise'
>> :) which explicitly is 'hey only do this if madvise tells you to'.

I think it's a bit reductive to suggest that enabled=madvise means all madvise
calls though. I don't think anyone would suggest MADV_DONTNEED should be ignored
if enabled=never. MADV_COLLAPSE just happens to be implemented on top of the THP
logic. But it's a different feature in my view.

>>
>> I'd be rather surprised if I hadn't set madvise and a user uses madvise to
>> in some fashion override the never.
>>
>> I mean I think we all agree this interface is to use a technical term -
>> crap - and we need something a lot more fine-grained and smart, 

Yes agreed there!

>> but I think
>> given the situation we're in we should make it at least as least surprising
>> as possible.

> 
> Yes. If you configure "never" you are supposed to suffer, consistently.
> 

OK fair enough. Just giving my 2 cents.

Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled

Posted by Lorenzo Stoakes 8 months, 1 week ago

On Fri, May 30, 2025 at 10:07:11AM +0100, Ryan Roberts wrote:
> On 30/05/2025 09:52, David Hildenbrand wrote:
> > On 30.05.25 10:47, Lorenzo Stoakes wrote:
> >> On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
> >>> On 30.05.25 10:04, Ryan Roberts wrote:
> >>>> On 29/05/2025 09:23, Baolin Wang wrote:
> >>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> >>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
> >>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> >>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
> >>>>> agreed upon: never means never. This patch set will address this issue.
> >>>>
> >>>> This is a drive-by comment from me without having the previous context, but...
> >>>>
> >>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
> >>>> user-initiated, synchonous request to use huge pages for a range of memory.
> >>>> There is nothing *transparent* about it, it just happens to be implemented
> >>>> using
> >>>> the same logic that THP uses.
> >>>>
> >>>> I always thought this was a deliberate design decision.
> >>>
> >>> If the admin said "never", then why should a user be able to overwrite that?
> >>>
> >>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
> >>> that. Because that was set by the app itself (MADV_NOHUEPAGE).
> >>>
> >>
> >> I'm with David on this one.
> >>
> >> I think it's principal of least surprise - to me 'never' is pretty
> >> emphatic, and keep in mind the other choices are 'always' and...  'madvise'
> >> :) which explicitly is 'hey only do this if madvise tells you to'.
>
> I think it's a bit reductive to suggest that enabled=madvise means all madvise
> calls though. I don't think anyone would suggest MADV_DONTNEED should be ignored
> if enabled=never. MADV_COLLAPSE just happens to be implemented on top of the THP
> logic. But it's a different feature in my view.

No I absolutely take your point, and indeed this is very reductive, but I think
that's a product of this interface being... sub-optimal.

if you dig into the docs for instance it's explicit about that referring to
MADV_[NO]HUGEPAGE.

But, as a user/sys-admin, I'd definitely find that surprising.

I think the intent of 'never' people is 'THP bad I don't want it' for whatever
reason that might be the case.

>
> >>
> >> I'd be rather surprised if I hadn't set madvise and a user uses madvise to
> >> in some fashion override the never.
> >>
> >> I mean I think we all agree this interface is to use a technical term -
> >> crap - and we need something a lot more fine-grained and smart,
>
> Yes agreed there!
>
> >> but I think
> >> given the situation we're in we should make it at least as least surprising
> >> as possible.
>
> >
> > Yes. If you configure "never" you are supposed to suffer, consistently.
> >
>
> OK fair enough. Just giving my 2 cents.
>
>

Your input is very welcome! We have made a mess here so it's good to talk it
through.