[RFC PATCH 0/5] add mTHP support for anonymous share pages

Baolin Wang posted 5 patches 1 year, 9 months ago
include/linux/huge_mm.h |   4 +-
mm/huge_memory.c        |   8 ++-
mm/memory.c             |  25 +++++++---
mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
4 files changed, 108 insertions(+), 36 deletions(-)
[RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Baolin Wang 1 year, 9 months ago
Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shared pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Many implement anonymous page sharing through
mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
therefore, users expect to apply an unified mTHP strategy for anonymous pages,
also including the anonymous shared pages, in order to enjoy the benefits of
mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is that, the use of huge pages for anonymous shared pages
still follows the global control determined by the mount option "huge=" parameter
or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
is checked to determine the mTHP size that can be used for large folio allocation
for these anonymous shared pages.

TODO:
 - More testing and provide some performance data.
 - Need more discussion about the large folio allocation strategy for a 'regular
file' operation created by memfd_create(), for example using ftruncate(fd) to specify
the 'file' size, which need to follow the anonymous mTHP rule too?
 - Do not split the large folio when share memory swap out.
 - Can swap in a large folio for share memory.

Baolin Wang (5):
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add mTHP support for anonymous share pages
  mm: shmem: add anonymous share mTHP counters

 include/linux/huge_mm.h |   4 +-
 mm/huge_memory.c        |   8 ++-
 mm/memory.c             |  25 +++++++---
 mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
 4 files changed, 108 insertions(+), 36 deletions(-)

-- 
2.39.3
Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Ryan Roberts 1 year, 9 months ago
On 22/04/2024 08:02, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shared pages will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Many implement anonymous page sharing through
> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> also including the anonymous shared pages, in order to enjoy the benefits of
> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.

This sounds like a very useful addition!

Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
for those workloads) that predominantly use shared anon memory?

> 
> The primary strategy is that, the use of huge pages for anonymous shared pages
> still follows the global control determined by the mount option "huge=" parameter
> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
> The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
> Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
> is checked to determine the mTHP size that can be used for large folio allocation
> for these anonymous shared pages.

I'm not sure about this proposed control mechanism; won't it break
compatibility? I could be wrong, but I don't think shmem's use of THP used to
depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
doesn't make sense to me that we now depend upon the
/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
default disables all sizes except 2M, which is set to "inherit" from
/sys/kernel/mm/transparent_hugepage/enabled).

The other problem is that shmem_enabled has a different set of options
(always/never/within_size/advise/deny/force) to enabled (always/madvise/never)

Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
plus the additional "inherit" option. By default all sizes will be set to
"never" except 2M, which is set to "inherit".

Of course the huge= mount option would also need to take a per-size option in
this case. e.g. huge=2048kB:advise,64kB:always

> 
> TODO:
>  - More testing and provide some performance data.
>  - Need more discussion about the large folio allocation strategy for a 'regular
> file' operation created by memfd_create(), for example using ftruncate(fd) to specify
> the 'file' size, which need to follow the anonymous mTHP rule too?
>  - Do not split the large folio when share memory swap out.
>  - Can swap in a large folio for share memory.
> 
> Baolin Wang (5):
>   mm: memory: extend finish_fault() to support large folio
>   mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>   mm: shmem: add THP validation for PMD-mapped THP related statistics
>   mm: shmem: add mTHP support for anonymous share pages
>   mm: shmem: add anonymous share mTHP counters
> 
>  include/linux/huge_mm.h |   4 +-
>  mm/huge_memory.c        |   8 ++-
>  mm/memory.c             |  25 +++++++---
>  mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>  4 files changed, 108 insertions(+), 36 deletions(-)
>
Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Baolin Wang 1 year, 9 months ago

On 2024/4/23 18:41, Ryan Roberts wrote:
> On 22/04/2024 08:02, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> This sounds like a very useful addition!
> 
> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
> for those workloads) that predominantly use shared anon memory?

As far as I know, some database related workloads make extensive use of 
shared anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL 
likely also uses shared anonymous memory. And I still need to do some 
investigation to measure the performance.

[1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL

>> The primary strategy is that, the use of huge pages for anonymous shared pages
>> still follows the global control determined by the mount option "huge=" parameter
>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>> The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
>> Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>> is checked to determine the mTHP size that can be used for large folio allocation
>> for these anonymous shared pages.
> 
> I'm not sure about this proposed control mechanism; won't it break
> compatibility? I could be wrong, but I don't think shmem's use of THP used to
> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it

Yes, I realized this after more testing.

> doesn't make sense to me that we now depend upon the
> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
> default disables all sizes except 2M, which is set to "inherit" from
> /sys/kernel/mm/transparent_hugepage/enabled).
> 
> The other problem is that shmem_enabled has a different set of options
> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
> 
> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
> plus the additional "inherit" option. By default all sizes will be set to
> "never" except 2M, which is set to "inherit".

Sounds good to me. But I do not want to copy all same values from 
top-level '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
always within_size advise never deny force

For mTHP's shmem_enabled interface, we can just keep below values:
always within_size advise never

Cause when checking if mTHP can be used for anon shmem, 'deny' is equal 
to 'never', and 'force' is equal to 'always'.

> Of course the huge= mount option would also need to take a per-size option in
> this case. e.g. huge=2048kB:advise,64kB:always

IMO, I do not want to change the global 'huge=' mount option, which can 
control both anon shmem and tmpfs, but mTHP now is only applied for anon 
shmem. So let's keep it be same with the global sysfs interface: 
/sys/kernel/mm/transparent_hugepage/shmem_enabled.

For tmpfs large folio strategy, I plan to address it later, and we may 
need more discussion to determine if it should follow the file large 
folio strategy or not (no investigation now).

Thanks for reviewing.

>> TODO:
>>   - More testing and provide some performance data.
>>   - Need more discussion about the large folio allocation strategy for a 'regular
>> file' operation created by memfd_create(), for example using ftruncate(fd) to specify
>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>   - Do not split the large folio when share memory swap out.
>>   - Can swap in a large folio for share memory.
>>
>> Baolin Wang (5):
>>    mm: memory: extend finish_fault() to support large folio
>>    mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>    mm: shmem: add THP validation for PMD-mapped THP related statistics
>>    mm: shmem: add mTHP support for anonymous share pages
>>    mm: shmem: add anonymous share mTHP counters
>>
>>   include/linux/huge_mm.h |   4 +-
>>   mm/huge_memory.c        |   8 ++-
>>   mm/memory.c             |  25 +++++++---
>>   mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>   4 files changed, 108 insertions(+), 36 deletions(-)
>>
Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Ryan Roberts 1 year, 9 months ago
On 24/04/2024 07:55, Baolin Wang wrote:
> 
> 
> On 2024/4/23 18:41, Ryan Roberts wrote:
>> On 22/04/2024 08:02, Baolin Wang wrote:
>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>> sysfs interface located at
>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>
>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>> configured through the sysfs interface, and can only use the PMD-mapped
>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> This sounds like a very useful addition!
>>
>> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
>> for those workloads) that predominantly use shared anon memory?
> 
> As far as I know, some database related workloads make extensive use of shared
> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
> uses shared anonymous memory. And I still need to do some investigation to
> measure the performance.
> 
> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL

Thanks for the pointer!

> 
>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>> still follows the global control determined by the mount option "huge="
>>> parameter
>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>> enabled.
>>> Subsequently, the mTHP sysfs interface
>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>> is checked to determine the mTHP size that can be used for large folio
>>> allocation
>>> for these anonymous shared pages.
>>
>> I'm not sure about this proposed control mechanism; won't it break
>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
> 
> Yes, I realized this after more testing.
> 
>> doesn't make sense to me that we now depend upon the
>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>> default disables all sizes except 2M, which is set to "inherit" from
>> /sys/kernel/mm/transparent_hugepage/enabled).
>>
>> The other problem is that shmem_enabled has a different set of options
>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>
>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>> plus the additional "inherit" option. By default all sizes will be set to
>> "never" except 2M, which is set to "inherit".
> 
> Sounds good to me. But I do not want to copy all same values from top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
> always within_size advise never deny force
> 
> For mTHP's shmem_enabled interface, we can just keep below values:
> always within_size advise never
> 
> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
> 'never', and 'force' is equal to 'always'.

I'll admit it wasn't completely clear to me after reading the docs, but my rough
understanding is:

 - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
   mmap(SHARED|ANON) allocations (mostly; see rule 3)
 - huge=... controls tmpfs allocations
 - deny and force in shmem_enabled are equivalent to never and always for
   mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
   if they were mounted with huge=never or huge=always

Is that correct? If so, then I think it still makes sense to support per-size
deny/force. Certainly if a per-size control is set to "inherit" and the
top-level control is set to deny or force, you would need that to mean something.

> 
>> Of course the huge= mount option would also need to take a per-size option in
>> this case. e.g. huge=2048kB:advise,64kB:always
> 
> IMO, I do not want to change the global 'huge=' mount option, which can control
> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's

How does huge= control anon shmem? I thought it was only for mounted
filesystems; so tmpfs? Perhaps my mental model for how this works is broken...

> keep it be same with the global sysfs interface:
> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
> 
> For tmpfs large folio strategy, I plan to address it later, and we may need more
> discussion to determine if it should follow the file large folio strategy or not
> (no investigation now).

OK. But until you get to tmpfs, you'll need an interim definition for what it
means if a per-size control is set to "inherit" and the top-level control is set
to deny/force.

> 
> Thanks for reviewing.

No problem! Thanks for doing the work!

> 
>>> TODO:
>>>   - More testing and provide some performance data.
>>>   - Need more discussion about the large folio allocation strategy for a
>>> 'regular
>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>> specify
>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>   - Do not split the large folio when share memory swap out.
>>>   - Can swap in a large folio for share memory.
>>>
>>> Baolin Wang (5):
>>>    mm: memory: extend finish_fault() to support large folio
>>>    mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>    mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>    mm: shmem: add mTHP support for anonymous share pages
>>>    mm: shmem: add anonymous share mTHP counters
>>>
>>>   include/linux/huge_mm.h |   4 +-
>>>   mm/huge_memory.c        |   8 ++-
>>>   mm/memory.c             |  25 +++++++---
>>>   mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>   4 files changed, 108 insertions(+), 36 deletions(-)
>>>

Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Baolin Wang 1 year, 9 months ago

On 2024/4/24 16:26, Ryan Roberts wrote:
> On 24/04/2024 07:55, Baolin Wang wrote:
>>
>>
>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>> sysfs interface located at
>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>
>>> This sounds like a very useful addition!
>>>
>>> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
>>> for those workloads) that predominantly use shared anon memory?
>>
>> As far as I know, some database related workloads make extensive use of shared
>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>> uses shared anonymous memory. And I still need to do some investigation to
>> measure the performance.
>>
>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
> 
> Thanks for the pointer!
> 
>>
>>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>>> still follows the global control determined by the mount option "huge="
>>>> parameter
>>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>> enabled.
>>>> Subsequently, the mTHP sysfs interface
>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>> is checked to determine the mTHP size that can be used for large folio
>>>> allocation
>>>> for these anonymous shared pages.
>>>
>>> I'm not sure about this proposed control mechanism; won't it break
>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>
>> Yes, I realized this after more testing.
>>
>>> doesn't make sense to me that we now depend upon the
>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>> default disables all sizes except 2M, which is set to "inherit" from
>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>
>>> The other problem is that shmem_enabled has a different set of options
>>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>>
>>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>> plus the additional "inherit" option. By default all sizes will be set to
>>> "never" except 2M, which is set to "inherit".
>>
>> Sounds good to me. But I do not want to copy all same values from top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>> always within_size advise never deny force
>>
>> For mTHP's shmem_enabled interface, we can just keep below values:
>> always within_size advise never
>>
>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>> 'never', and 'force' is equal to 'always'.
> 
> I'll admit it wasn't completely clear to me after reading the docs, but my rough
> understanding is:
> 
>   - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>     mmap(SHARED|ANON) allocations (mostly; see rule 3)
>   - huge=... controls tmpfs allocations
>   - deny and force in shmem_enabled are equivalent to never and always for
>     mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
>     if they were mounted with huge=never or huge=always
> 
> Is that correct? If so, then I think it still makes sense to support per-size

Correct.

> deny/force. Certainly if a per-size control is set to "inherit" and the
> top-level control is set to deny or force, you would need that to mean something.

IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' 
interface should only control the anonymous shmem. And 'huge=' controls 
tmpfs allocation, so we should not use anonymous control to override 
tmpfs control, which seems a little mess?

>>> Of course the huge= mount option would also need to take a per-size option in
>>> this case. e.g. huge=2048kB:advise,64kB:always
>>
>> IMO, I do not want to change the global 'huge=' mount option, which can control
>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's
> 
> How does huge= control anon shmem? I thought it was only for mounted
> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...

Sorry for noise, you are right. So this is still the reason I don't want 
to change the semantics of 'huge=', which is used to control tmpfs.

>> keep it be same with the global sysfs interface:
>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>
>> For tmpfs large folio strategy, I plan to address it later, and we may need more
>> discussion to determine if it should follow the file large folio strategy or not
>> (no investigation now).
> 
> OK. But until you get to tmpfs, you'll need an interim definition for what it
> means if a per-size control is set to "inherit" and the top-level control is set
> to deny/force.
> 
>>
>> Thanks for reviewing.
> 
> No problem! Thanks for doing the work!
> 
>>
>>>> TODO:
>>>>    - More testing and provide some performance data.
>>>>    - Need more discussion about the large folio allocation strategy for a
>>>> 'regular
>>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>>> specify
>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>    - Do not split the large folio when share memory swap out.
>>>>    - Can swap in a large folio for share memory.
>>>>
>>>> Baolin Wang (5):
>>>>     mm: memory: extend finish_fault() to support large folio
>>>>     mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>     mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>     mm: shmem: add mTHP support for anonymous share pages
>>>>     mm: shmem: add anonymous share mTHP counters
>>>>
>>>>    include/linux/huge_mm.h |   4 +-
>>>>    mm/huge_memory.c        |   8 ++-
>>>>    mm/memory.c             |  25 +++++++---
>>>>    mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>    4 files changed, 108 insertions(+), 36 deletions(-)
>>>>
Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Ryan Roberts 1 year, 9 months ago
On 24/04/2024 10:55, Baolin Wang wrote:
> 
> 
> On 2024/4/24 16:26, Ryan Roberts wrote:
>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>> sysfs interface located at
>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>
>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>> etc.
>>>>
>>>> This sounds like a very useful addition!
>>>>
>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>> benchmarks
>>>> for those workloads) that predominantly use shared anon memory?
>>>
>>> As far as I know, some database related workloads make extensive use of shared
>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>> uses shared anonymous memory. And I still need to do some investigation to
>>> measure the performance.
>>>
>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>
>> Thanks for the pointer!
>>
>>>
>>>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>>>> still follows the global control determined by the mount option "huge="
>>>>> parameter
>>>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>> enabled.
>>>>> Subsequently, the mTHP sysfs interface
>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>> allocation
>>>>> for these anonymous shared pages.
>>>>
>>>> I'm not sure about this proposed control mechanism; won't it break
>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>
>>> Yes, I realized this after more testing.
>>>
>>>> doesn't make sense to me that we now depend upon the
>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>
>>>> The other problem is that shmem_enabled has a different set of options
>>>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>>>
>>>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>> "never" except 2M, which is set to "inherit".
>>>
>>> Sounds good to me. But I do not want to copy all same values from top-level
>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>> always within_size advise never deny force
>>>
>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>> always within_size advise never
>>>
>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>> 'never', and 'force' is equal to 'always'.
>>
>> I'll admit it wasn't completely clear to me after reading the docs, but my rough
>> understanding is:
>>
>>   - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>     mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>   - huge=... controls tmpfs allocations
>>   - deny and force in shmem_enabled are equivalent to never and always for
>>     mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
>>     if they were mounted with huge=never or huge=always
>>
>> Is that correct? If so, then I think it still makes sense to support per-size
> 
> Correct.
> 
>> deny/force. Certainly if a per-size control is set to "inherit" and the
>> top-level control is set to deny or force, you would need that to mean something.
> 
> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
> so we should not use anonymous control to override tmpfs control, which seems a
> little mess?

I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
tmpfs stuff for another time. But my point is that
/mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
value is deny or force. So if you have:

echo deny > /mm/transparent_hugepage/shmem_enabled
echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled

What does that mean?

> 
>>>> Of course the huge= mount option would also need to take a per-size option in
>>>> this case. e.g. huge=2048kB:advise,64kB:always
>>>
>>> IMO, I do not want to change the global 'huge=' mount option, which can control
>>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's
>>
>> How does huge= control anon shmem? I thought it was only for mounted
>> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...
> 
> Sorry for noise, you are right. So this is still the reason I don't want to
> change the semantics of 'huge=', which is used to control tmpfs.
> 
>>> keep it be same with the global sysfs interface:
>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>>
>>> For tmpfs large folio strategy, I plan to address it later, and we may need more
>>> discussion to determine if it should follow the file large folio strategy or not
>>> (no investigation now).
>>
>> OK. But until you get to tmpfs, you'll need an interim definition for what it
>> means if a per-size control is set to "inherit" and the top-level control is set
>> to deny/force.
>>
>>>
>>> Thanks for reviewing.
>>
>> No problem! Thanks for doing the work!
>>
>>>
>>>>> TODO:
>>>>>    - More testing and provide some performance data.
>>>>>    - Need more discussion about the large folio allocation strategy for a
>>>>> 'regular
>>>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>>>> specify
>>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>>    - Do not split the large folio when share memory swap out.
>>>>>    - Can swap in a large folio for share memory.
>>>>>
>>>>> Baolin Wang (5):
>>>>>     mm: memory: extend finish_fault() to support large folio
>>>>>     mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>>     mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>>     mm: shmem: add mTHP support for anonymous share pages
>>>>>     mm: shmem: add anonymous share mTHP counters
>>>>>
>>>>>    include/linux/huge_mm.h |   4 +-
>>>>>    mm/huge_memory.c        |   8 ++-
>>>>>    mm/memory.c             |  25 +++++++---
>>>>>    mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>>    4 files changed, 108 insertions(+), 36 deletions(-)
>>>>>

Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Baolin Wang 1 year, 9 months ago

On 2024/4/24 18:01, Ryan Roberts wrote:
> On 24/04/2024 10:55, Baolin Wang wrote:
>>
>>
>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>>> sysfs interface located at
>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>
>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>>> etc.
>>>>>
>>>>> This sounds like a very useful addition!
>>>>>
>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>> benchmarks
>>>>> for those workloads) that predominantly use shared anon memory?
>>>>
>>>> As far as I know, some database related workloads make extensive use of shared
>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>> measure the performance.
>>>>
>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>
>>> Thanks for the pointer!
>>>
>>>>
>>>>>> The primary strategy is that, the use of huge pages for anonymous shared pages
>>>>>> still follows the global control determined by the mount option "huge="
>>>>>> parameter
>>>>>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>> enabled.
>>>>>> Subsequently, the mTHP sysfs interface
>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>> allocation
>>>>>> for these anonymous shared pages.
>>>>>
>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>
>>>> Yes, I realized this after more testing.
>>>>
>>>>> doesn't make sense to me that we now depend upon the
>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>
>>>>> The other problem is that shmem_enabled has a different set of options
>>>>> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
>>>>>
>>>>> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>>> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>> "never" except 2M, which is set to "inherit".
>>>>
>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>> always within_size advise never deny force
>>>>
>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>> always within_size advise never
>>>>
>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>> 'never', and 'force' is equal to 'always'.
>>>
>>> I'll admit it wasn't completely clear to me after reading the docs, but my rough
>>> understanding is:
>>>
>>>    - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>      mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>    - huge=... controls tmpfs allocations
>>>    - deny and force in shmem_enabled are equivalent to never and always for
>>>      mmap(SHARED|ANON) but additionally override all tmpfs mounts so they act as
>>>      if they were mounted with huge=never or huge=always
>>>
>>> Is that correct? If so, then I think it still makes sense to support per-size
>>
>> Correct.
>>
>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>> top-level control is set to deny or force, you would need that to mean something.
>>
>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
>> so we should not use anonymous control to override tmpfs control, which seems a
>> little mess?
> 
> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
> tmpfs stuff for another time. But my point is that
> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
> value is deny or force. So if you have:
> 
> echo deny > /mm/transparent_hugepage/shmem_enabled

IIUC, this global control will cause shmem_is_huge() to always return 
false, so no matter how 
'/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set, 
anonymous shmem will not use mTHP. No?

> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
> 
> What does that mean?
> 
>>
>>>>> Of course the huge= mount option would also need to take a per-size option in
>>>>> this case. e.g. huge=2048kB:advise,64kB:always
>>>>
>>>> IMO, I do not want to change the global 'huge=' mount option, which can control
>>>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So let's
>>>
>>> How does huge= control anon shmem? I thought it was only for mounted
>>> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...
>>
>> Sorry for noise, you are right. So this is still the reason I don't want to
>> change the semantics of 'huge=', which is used to control tmpfs.
>>
>>>> keep it be same with the global sysfs interface:
>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>>>
>>>> For tmpfs large folio strategy, I plan to address it later, and we may need more
>>>> discussion to determine if it should follow the file large folio strategy or not
>>>> (no investigation now).
>>>
>>> OK. But until you get to tmpfs, you'll need an interim definition for what it
>>> means if a per-size control is set to "inherit" and the top-level control is set
>>> to deny/force.
>>>
>>>>
>>>> Thanks for reviewing.
>>>
>>> No problem! Thanks for doing the work!
>>>
>>>>
>>>>>> TODO:
>>>>>>     - More testing and provide some performance data.
>>>>>>     - Need more discussion about the large folio allocation strategy for a
>>>>>> 'regular
>>>>>> file' operation created by memfd_create(), for example using ftruncate(fd) to
>>>>>> specify
>>>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>>>     - Do not split the large folio when share memory swap out.
>>>>>>     - Can swap in a large folio for share memory.
>>>>>>
>>>>>> Baolin Wang (5):
>>>>>>      mm: memory: extend finish_fault() to support large folio
>>>>>>      mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>>>      mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>>>      mm: shmem: add mTHP support for anonymous share pages
>>>>>>      mm: shmem: add anonymous share mTHP counters
>>>>>>
>>>>>>     include/linux/huge_mm.h |   4 +-
>>>>>>     mm/huge_memory.c        |   8 ++-
>>>>>>     mm/memory.c             |  25 +++++++---
>>>>>>     mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>>>     4 files changed, 108 insertions(+), 36 deletions(-)
>>>>>>
Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Ryan Roberts 1 year, 9 months ago
On 24/04/2024 14:49, Baolin Wang wrote:
> 
> 
> On 2024/4/24 18:01, Ryan Roberts wrote:
>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>>>> sysfs interface located at
>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>
>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>> pages,
>>>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>>>> etc.
>>>>>>
>>>>>> This sounds like a very useful addition!
>>>>>>
>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>> benchmarks
>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>
>>>>> As far as I know, some database related workloads make extensive use of shared
>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>> measure the performance.
>>>>>
>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>
>>>> Thanks for the pointer!
>>>>
>>>>>
>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>> pages
>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>> parameter
>>>>>>> or the sysfs interface at
>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>> enabled.
>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>> allocation
>>>>>>> for these anonymous shared pages.
>>>>>>
>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>>
>>>>> Yes, I realized this after more testing.
>>>>>
>>>>>> doesn't make sense to me that we now depend upon the
>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>
>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>> (always/madvise/never)
>>>>>>
>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>> Introduce
>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>>>> same values as the top-level
>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>> "never" except 2M, which is set to "inherit".
>>>>>
>>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>> always within_size advise never deny force
>>>>>
>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>> always within_size advise never
>>>>>
>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>> 'never', and 'force' is equal to 'always'.
>>>>
>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>> rough
>>>> understanding is:
>>>>
>>>>    - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>      mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>    - huge=... controls tmpfs allocations
>>>>    - deny and force in shmem_enabled are equivalent to never and always for
>>>>      mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>> act as
>>>>      if they were mounted with huge=never or huge=always
>>>>
>>>> Is that correct? If so, then I think it still makes sense to support per-size
>>>
>>> Correct.
>>>
>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>> top-level control is set to deny or force, you would need that to mean
>>>> something.
>>>
>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
>>> so we should not use anonymous control to override tmpfs control, which seems a
>>> little mess?
>>
>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
>> tmpfs stuff for another time. But my point is that
>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>> value is deny or force. So if you have:
>>
>> echo deny > /mm/transparent_hugepage/shmem_enabled
> 
> IIUC, this global control will cause shmem_is_huge() to always return false, so
> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
> anonymous shmem will not use mTHP. No?

No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
the established pattern.

For anon-private, each size is controlled by its
/mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
"inherit", in which case the value in /mm/transparent_hugepage/enabled is used
for that size.

That approach enables us to 1) maintain back-compat and 2) control each size
independently

1) is met because the default is that all sizes are initially set to "never",
except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
which is initially set to inherit. So any mTHP unaware SW can still modify
/mm/transparent_hugepage/enabled and it will still only apply to PMD size.

2) is met because mTHP aware SW can come along and e.g. enable the 64K size
(echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without having to
modify the value in /mm/transparent_hugepage/enabled.

> 
>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>
>> What does that mean?

So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
support the deny and force values. When applied to non-PMD sizes, "deny" can
just be a noop for now, because there was no way to configure a tmpfs mount for
non-PMD size THP in the first place. But I'm not sure what to do with "force"?

>>
>>>
>>>>>> Of course the huge= mount option would also need to take a per-size option in
>>>>>> this case. e.g. huge=2048kB:advise,64kB:always
>>>>>
>>>>> IMO, I do not want to change the global 'huge=' mount option, which can
>>>>> control
>>>>> both anon shmem and tmpfs, but mTHP now is only applied for anon shmem. So
>>>>> let's
>>>>
>>>> How does huge= control anon shmem? I thought it was only for mounted
>>>> filesystems; so tmpfs? Perhaps my mental model for how this works is broken...
>>>
>>> Sorry for noise, you are right. So this is still the reason I don't want to
>>> change the semantics of 'huge=', which is used to control tmpfs.
>>>
>>>>> keep it be same with the global sysfs interface:
>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled.
>>>>>
>>>>> For tmpfs large folio strategy, I plan to address it later, and we may need
>>>>> more
>>>>> discussion to determine if it should follow the file large folio strategy
>>>>> or not
>>>>> (no investigation now).
>>>>
>>>> OK. But until you get to tmpfs, you'll need an interim definition for what it
>>>> means if a per-size control is set to "inherit" and the top-level control is
>>>> set
>>>> to deny/force.
>>>>
>>>>>
>>>>> Thanks for reviewing.
>>>>
>>>> No problem! Thanks for doing the work!
>>>>
>>>>>
>>>>>>> TODO:
>>>>>>>     - More testing and provide some performance data.
>>>>>>>     - Need more discussion about the large folio allocation strategy for a
>>>>>>> 'regular
>>>>>>> file' operation created by memfd_create(), for example using
>>>>>>> ftruncate(fd) to
>>>>>>> specify
>>>>>>> the 'file' size, which need to follow the anonymous mTHP rule too?
>>>>>>>     - Do not split the large folio when share memory swap out.
>>>>>>>     - Can swap in a large folio for share memory.
>>>>>>>
>>>>>>> Baolin Wang (5):
>>>>>>>      mm: memory: extend finish_fault() to support large folio
>>>>>>>      mm: shmem: add an 'order' parameter for shmem_alloc_hugefolio()
>>>>>>>      mm: shmem: add THP validation for PMD-mapped THP related statistics
>>>>>>>      mm: shmem: add mTHP support for anonymous share pages
>>>>>>>      mm: shmem: add anonymous share mTHP counters
>>>>>>>
>>>>>>>     include/linux/huge_mm.h |   4 +-
>>>>>>>     mm/huge_memory.c        |   8 ++-
>>>>>>>     mm/memory.c             |  25 +++++++---
>>>>>>>     mm/shmem.c              | 107 ++++++++++++++++++++++++++++++----------
>>>>>>>     4 files changed, 108 insertions(+), 36 deletions(-)
>>>>>>>

Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Baolin Wang 1 year, 9 months ago

On 2024/4/24 22:20, Ryan Roberts wrote:
> On 24/04/2024 14:49, Baolin Wang wrote:
>>
>>
>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>>>>>>>> through commit 19eaf44954df, that can allow THP to be configured through the
>>>>>>>> sysfs interface located at
>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>
>>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>>> pages,
>>>>>>>> also including the anonymous shared pages, in order to enjoy the benefits of
>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss
>>>>>>>> etc.
>>>>>>>
>>>>>>> This sounds like a very useful addition!
>>>>>>>
>>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>>> benchmarks
>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>
>>>>>> As far as I know, some database related workloads make extensive use of shared
>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely also
>>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>>> measure the performance.
>>>>>>
>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>
>>>>> Thanks for the pointer!
>>>>>
>>>>>>
>>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>>> pages
>>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>>> parameter
>>>>>>>> or the sysfs interface at
>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>>> enabled.
>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>>> allocation
>>>>>>>> for these anonymous shared pages.
>>>>>>>
>>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP used to
>>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>>>
>>>>>> Yes, I realized this after more testing.
>>>>>>
>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>
>>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>> (always/madvise/never)
>>>>>>>
>>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>>> Introduce
>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
>>>>>>> same values as the top-level
>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>
>>>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>> always within_size advise never deny force
>>>>>>
>>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>>> always within_size advise never
>>>>>>
>>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>
>>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>>> rough
>>>>> understanding is:
>>>>>
>>>>>     - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>>       mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>     - huge=... controls tmpfs allocations
>>>>>     - deny and force in shmem_enabled are equivalent to never and always for
>>>>>       mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>>> act as
>>>>>       if they were mounted with huge=never or huge=always
>>>>>
>>>>> Is that correct? If so, then I think it still makes sense to support per-size
>>>>
>>>> Correct.
>>>>
>>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>>> top-level control is set to deny or force, you would need that to mean
>>>>> something.
>>>>
>>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>> should only control the anonymous shmem. And 'huge=' controls tmpfs allocation,
>>>> so we should not use anonymous control to override tmpfs control, which seems a
>>>> little mess?
>>>
>>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave the
>>> tmpfs stuff for another time. But my point is that
>>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>>> value is deny or force. So if you have:
>>>
>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>
>> IIUC, this global control will cause shmem_is_huge() to always return false, so
>> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>> anonymous shmem will not use mTHP. No?
> 
> No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
> the established pattern.
> 
> For anon-private, each size is controlled by its
> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
> "inherit", in which case the value in /mm/transparent_hugepage/enabled is used
> for that size.
> 
> That approach enables us to 1) maintain back-compat and 2) control each size
> independently
> 
> 1) is met because the default is that all sizes are initially set to "never",
> except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
> which is initially set to inherit. So any mTHP unaware SW can still modify
> /mm/transparent_hugepage/enabled and it will still only apply to PMD size.
> 
> 2) is met because mTHP aware SW can come along and e.g. enable the 64K size
> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without having to
> modify the value in /mm/transparent_hugepage/enabled.

Thanks for explanation. Initially, I want to make 
‘/mm/transparent_hugepage/shmem_enabled’ be a global control for huge 
page, but I think it should follow the same strategy as anon mTHP as you 
said.

>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>
>>> What does that mean?
> 
> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
> support the deny and force values. When applied to non-PMD sizes, "deny" can
> just be a noop for now, because there was no way to configure a tmpfs mount for
> non-PMD size THP in the first place. But I'm not sure what to do with "force"?

OK. And I also prefer that "force" should be a noop too, since anon 
shmem control should not configure tmpfs huge page allocation.
Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by Ryan Roberts 1 year, 9 months ago
On 25/04/2024 07:20, Baolin Wang wrote:
> 
> 
> On 2024/4/24 22:20, Ryan Roberts wrote:
>> On 24/04/2024 14:49, Baolin Wang wrote:
>>>
>>>
>>> On 2024/4/24 18:01, Ryan Roberts wrote:
>>>> On 24/04/2024 10:55, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/24 16:26, Ryan Roberts wrote:
>>>>>> On 24/04/2024 07:55, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/23 18:41, Ryan Roberts wrote:
>>>>>>>> On 22/04/2024 08:02, Baolin Wang wrote:
>>>>>>>>> Anonymous pages have already been supported for multi-size (mTHP)
>>>>>>>>> allocation
>>>>>>>>> through commit 19eaf44954df, that can allow THP to be configured
>>>>>>>>> through the
>>>>>>>>> sysfs interface located at
>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>>>>>
>>>>>>>>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>>>>>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>>>>>> THP, that is not reasonable. Many implement anonymous page sharing through
>>>>>>>>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>>>>>>>>> therefore, users expect to apply an unified mTHP strategy for anonymous
>>>>>>>>> pages,
>>>>>>>>> also including the anonymous shared pages, in order to enjoy the
>>>>>>>>> benefits of
>>>>>>>>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>>>>>>>>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB
>>>>>>>>> miss
>>>>>>>>> etc.
>>>>>>>>
>>>>>>>> This sounds like a very useful addition!
>>>>>>>>
>>>>>>>> Out of interest, can you point me at any workloads (and off-the-shelf
>>>>>>>> benchmarks
>>>>>>>> for those workloads) that predominantly use shared anon memory?
>>>>>>>
>>>>>>> As far as I know, some database related workloads make extensive use of
>>>>>>> shared
>>>>>>> anonymous page, such as PolarDB[1] in our Alibaba fleet, or MySQL likely
>>>>>>> also
>>>>>>> uses shared anonymous memory. And I still need to do some investigation to
>>>>>>> measure the performance.
>>>>>>>
>>>>>>> [1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
>>>>>>
>>>>>> Thanks for the pointer!
>>>>>>
>>>>>>>
>>>>>>>>> The primary strategy is that, the use of huge pages for anonymous shared
>>>>>>>>> pages
>>>>>>>>> still follows the global control determined by the mount option "huge="
>>>>>>>>> parameter
>>>>>>>>> or the sysfs interface at
>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>>>>>>>>> The utilization of mTHP is allowed only when the global 'huge' switch is
>>>>>>>>> enabled.
>>>>>>>>> Subsequently, the mTHP sysfs interface
>>>>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>>>>>>>>> is checked to determine the mTHP size that can be used for large folio
>>>>>>>>> allocation
>>>>>>>>> for these anonymous shared pages.
>>>>>>>>
>>>>>>>> I'm not sure about this proposed control mechanism; won't it break
>>>>>>>> compatibility? I could be wrong, but I don't think shmem's use of THP
>>>>>>>> used to
>>>>>>>> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
>>>>>>>
>>>>>>> Yes, I realized this after more testing.
>>>>>>>
>>>>>>>> doesn't make sense to me that we now depend upon the
>>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
>>>>>>>> default disables all sizes except 2M, which is set to "inherit" from
>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled).
>>>>>>>>
>>>>>>>> The other problem is that shmem_enabled has a different set of options
>>>>>>>> (always/never/within_size/advise/deny/force) to enabled
>>>>>>>> (always/madvise/never)
>>>>>>>>
>>>>>>>> Perhaps it would be cleaner to do the same trick we did for enabled;
>>>>>>>> Introduce
>>>>>>>> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all
>>>>>>>> the
>>>>>>>> same values as the top-level
>>>>>>>> /sys/kernel/mm/transparent_hugepage/shmem_enabled,
>>>>>>>> plus the additional "inherit" option. By default all sizes will be set to
>>>>>>>> "never" except 2M, which is set to "inherit".
>>>>>>>
>>>>>>> Sounds good to me. But I do not want to copy all same values from top-level
>>>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled':
>>>>>>> always within_size advise never deny force
>>>>>>>
>>>>>>> For mTHP's shmem_enabled interface, we can just keep below values:
>>>>>>> always within_size advise never
>>>>>>>
>>>>>>> Cause when checking if mTHP can be used for anon shmem, 'deny' is equal to
>>>>>>> 'never', and 'force' is equal to 'always'.
>>>>>>
>>>>>> I'll admit it wasn't completely clear to me after reading the docs, but my
>>>>>> rough
>>>>>> understanding is:
>>>>>>
>>>>>>     - /sys/kernel/mm/transparent_hugepage/shmem_enabled controls
>>>>>>       mmap(SHARED|ANON) allocations (mostly; see rule 3)
>>>>>>     - huge=... controls tmpfs allocations
>>>>>>     - deny and force in shmem_enabled are equivalent to never and always for
>>>>>>       mmap(SHARED|ANON) but additionally override all tmpfs mounts so they
>>>>>> act as
>>>>>>       if they were mounted with huge=never or huge=always
>>>>>>
>>>>>> Is that correct? If so, then I think it still makes sense to support per-size
>>>>>
>>>>> Correct.
>>>>>
>>>>>> deny/force. Certainly if a per-size control is set to "inherit" and the
>>>>>> top-level control is set to deny or force, you would need that to mean
>>>>>> something.
>>>>>
>>>>> IMHO, the '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
>>>>> should only control the anonymous shmem. And 'huge=' controls tmpfs
>>>>> allocation,
>>>>> so we should not use anonymous control to override tmpfs control, which
>>>>> seems a
>>>>> little mess?
>>>>
>>>> I agree it would be cleaner to only handle mmap(SHARED|ANON) here, and leave
>>>> the
>>>> tmpfs stuff for another time. But my point is that
>>>> /mm/transparent_hugepage/shmem_enabled already interferes with tmpfs if the
>>>> value is deny or force. So if you have:
>>>>
>>>> echo deny > /mm/transparent_hugepage/shmem_enabled
>>>
>>> IIUC, this global control will cause shmem_is_huge() to always return false, so
>>> no matter how '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
>>> anonymous shmem will not use mTHP. No?
>>
>> No, that's not how '/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
>> I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' should follow
>> the established pattern.
>>
>> For anon-private, each size is controlled by its
>> /mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that value is
>> "inherit", in which case the value in /mm/transparent_hugepage/enabled is used
>> for that size.
>>
>> That approach enables us to 1) maintain back-compat and 2) control each size
>> independently
>>
>> 1) is met because the default is that all sizes are initially set to "never",
>> except the PMD-size (e.g. /mm/transparent_hugepage/hugepage-2048kB/enabled)
>> which is initially set to inherit. So any mTHP unaware SW can still modify
>> /mm/transparent_hugepage/enabled and it will still only apply to PMD size.
>>
>> 2) is met because mTHP aware SW can come along and e.g. enable the 64K size
>> (echo always > /mm/transparent_hugepage/hugepage-64kB/enabled) without having to
>> modify the value in /mm/transparent_hugepage/enabled.
> 
> Thanks for explanation. Initially, I want to make
> ‘/mm/transparent_hugepage/shmem_enabled’ be a global control for huge page, but
> I think it should follow the same strategy as anon mTHP as you said.
> 
>>>> echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
>>>>
>>>> What does that mean?
>>
>> So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled will need to
>> support the deny and force values. When applied to non-PMD sizes, "deny" can
>> just be a noop for now, because there was no way to configure a tmpfs mount for
>> non-PMD size THP in the first place. But I'm not sure what to do with "force"?
> 
> OK. And I also prefer that "force" should be a noop too, since anon shmem
> control should not configure tmpfs huge page allocation.

I guess technically they won't be noops, but (for the non-PMD-sizes) "force"
will be an alias for "always" and "deny" will be an alias for "never"?

I was just a bit concerned about later changing that behavior to also impact
tmpfs once tmpfs supports mTHP; could that cause breaks? But thinking about it,
I don't see that as a problem.

Re: [RFC PATCH 0/5] add mTHP support for anonymous share pages
Posted by David Hildenbrand 1 year, 9 months ago
On 23.04.24 12:41, Ryan Roberts wrote:
> On 22/04/2024 08:02, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shared pages will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Many implement anonymous page sharing through
>> mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
>> therefore, users expect to apply an unified mTHP strategy for anonymous pages,
>> also including the anonymous shared pages, in order to enjoy the benefits of
>> mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
>> than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> This sounds like a very useful addition!
> 
> Out of interest, can you point me at any workloads (and off-the-shelf benchmarks
> for those workloads) that predominantly use shared anon memory?
> 
>>
>> The primary strategy is that, the use of huge pages for anonymous shared pages
>> still follows the global control determined by the mount option "huge=" parameter
>> or the sysfs interface at '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>> The utilization of mTHP is allowed only when the global 'huge' switch is enabled.
>> Subsequently, the mTHP sysfs interface (/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
>> is checked to determine the mTHP size that can be used for large folio allocation
>> for these anonymous shared pages.
> 
> I'm not sure about this proposed control mechanism; won't it break
> compatibility? I could be wrong, but I don't think shmem's use of THP used to
> depend upon the value of /sys/kernel/mm/transparent_hugepage/enabled? So it
> doesn't make sense to me that we now depend upon the
> /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled values (which by
> default disables all sizes except 2M, which is set to "inherit" from
> /sys/kernel/mm/transparent_hugepage/enabled).
> 
> The other problem is that shmem_enabled has a different set of options
> (always/never/within_size/advise/deny/force) to enabled (always/madvise/never)
> 
> Perhaps it would be cleaner to do the same trick we did for enabled; Introduce
> /mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which can have all the
> same values as the top-level /sys/kernel/mm/transparent_hugepage/shmem_enabled,
> plus the additional "inherit" option. By default all sizes will be set to
> "never" except 2M, which is set to "inherit".

Matches what I had in mind.

-- 
Cheers,

David / dhildenb