[RFC PATCH 0/3] Cgroup-based THP control

gutierrez.asier@huawei-partners.com posted 3 patches 3 weeks, 4 days ago
include/linux/huge_mm.h    |  23 +++-
include/linux/khugepaged.h |   2 +-
include/linux/memcontrol.h |  28 ++++
mm/huge_memory.c           | 207 ++++++++++++++++++-----------
mm/khugepaged.c            |   8 +-
mm/memcontrol.c            | 262 +++++++++++++++++++++++++++++++++++++
6 files changed, 449 insertions(+), 81 deletions(-)
[RFC PATCH 0/3] Cgroup-based THP control
Posted by gutierrez.asier@huawei-partners.com 3 weeks, 4 days ago
From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>

Currently THP modes are set globally. It can be an overkill if only some
specific app/set of apps need to get benefits from THP usage. Moreover, various
apps might need different THP settings. Here we propose a cgroup-based THP
control mechanism.

THP interface is added to memory cgroup subsystem. Existing global THP control
semantics is supported for backward compatibility. When THP modes are set
globally all the changes are propagated to memory cgroups. However, when a
particular cgroup changes its THP policy, the global THP policy in sysfs remains
the same.

New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
have completely the same format as global THP enabled/defrag.

Child cgroups inherit THP settings from parent cgroup upon creation. Particular
cgroup mode changes aren't propagated to child cgroups.

During the memory cgroup attachment stage, the correct slots
are added or removed to khugepaged according to the THP
policy.

Usage examples:

Set globally "madvise" mode:
# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

All the settings are propagated
# cat /sys/fs/cgroup/memory.thp_enabled
always [madvise] never

# cat /sys/fs/cgroup/test/memory.thp_enabled
always [madvise] never

Set "always" for some specific cgroup:
# echo always > /sys/fs/cgroup/test/memory.thp_enabled
# cat /sys/fs/cgroup/test/memory.thp_enabled
[always] madvise never

Root cgroup remains with "madvise" mode:
# cat /sys/fs/cgroup/memory.thp_enabled
always [madvise] never

When attempting to read global settings we get "mixed state" warning as the
THP-mode isn't the same for every cgroup:
# cat /sys/kernel/mm/transparent_hugepage/enabled
Mixed state: see particular memcg flags! 

Again, set THP mode globally, make sure everything works fine:
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# cat /sys/fs/cgroup/memory.thp_enabled
always madvise [never]

# cat /sys/fs/cgroup/test/memory.thp_enabled
always madvise [never]

Here is a simple demo with a 
test which is doing anon. mmap() and a series of random reads.
System is rebooted between the cases.

Case 1: Global THP - always. No cgroup.

// Global THP stats:
AnonHugePages:    391168 kB
FileHugePages:    120832 kB
FilePmdMapped:     67584 kB

// THP stats from *smaps* of the testing process
AnonHugePages:     12288 kB

Case 2: Global THP - never. Cgroup - always.

// Global THP stats:
AnonHugePages:     12288 kB
FileHugePages:      2048 kB
FilePmdMapped:      2048 kB

// THP stats from *smaps* of the testing process
AnonHugePages:     12288 kB

// The cgroup THP stats
anon_thp 12582912
file_thp 2097152

Obviously there's a huge difference between the two in terms of global THP 
usage, thus showing the cgroup approach is beneficial for such cases, when a 
specific app/set of apps needs THP, but not willing to change anything in the 
app. code.

TODO list:

1. Anonymous mTHP
2. Fine-grained mode selection for different VMA types: "anon|exec|ro|file", to
   be able to support combinations as: "always + exec", "always + anon", etc.
3. Per-cgroup limit for the THP usage


Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Reviewed-by: Alexander Kozhevnikov <alexander.kozhevnikov@huawei-partners.com>

Asier Gutierrez, Anatoly Stepanov (3):
  mm: Add thp_flags control for cgroup
  mm: Support for huge pages in cgroups
  mm: Add thp_defrag control for cgroup


 include/linux/huge_mm.h    |  23 +++-
 include/linux/khugepaged.h |   2 +-
 include/linux/memcontrol.h |  28 ++++
 mm/huge_memory.c           | 207 ++++++++++++++++++-----------
 mm/khugepaged.c            |   8 +-
 mm/memcontrol.c            | 262 +++++++++++++++++++++++++++++++++++++
 6 files changed, 449 insertions(+), 81 deletions(-)

-- 
2.34.1
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Chris Down 3 weeks, 4 days ago
gutierrez.asier@huawei-partners.com writes:
>New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>have completely the same format as global THP enabled/defrag.

cgroup controls exist because there are things we want to do for an entire 
class of processes (group OOM, resource control, etc). Enabling or disabling 
some specific setting is generally not one of them, hence why we got rid of 
things like per-cgroup vm.swappiness. We know that these controls do not 
compose well and have caused a lot of pain in the past. So my immediate 
reaction is a nack on the general concept, unless there's some absolutely 
compelling case here.

I talked a little at Kernel Recipes last year about moving away from sysctl and 
other global interfaces and making things more granular. Don't get me wrong, I 
think that is a good thing (although, of course, a very large undertaking) -- 
but it is a mistake to overload the amount of controls we expose as part of the 
cgroup interface.

I am up for thinking overall about how we can improve the state of global 
tunables to make them more granular overall, but this can't set a precedent as 
the way to do it.
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 4 days ago
On Wed 30-10-24 14:45:24, Chris Down wrote:
> gutierrez.asier@huawei-partners.com writes:
> > New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> > have completely the same format as global THP enabled/defrag.
> 
> cgroup controls exist because there are things we want to do for an entire
> class of processes (group OOM, resource control, etc). Enabling or disabling
> some specific setting is generally not one of them, hence why we got rid of
> things like per-cgroup vm.swappiness. We know that these controls do not
> compose well and have caused a lot of pain in the past. So my immediate
> reaction is a nack on the general concept, unless there's some absolutely
> compelling case here.
> 
> I talked a little at Kernel Recipes last year about moving away from sysctl
> and other global interfaces and making things more granular. Don't get me
> wrong, I think that is a good thing (although, of course, a very large
> undertaking) -- but it is a mistake to overload the amount of controls we
> expose as part of the cgroup interface.

Completely agreed!

-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Matthew Wilcox 3 weeks, 4 days ago
On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.

Or maybe we should stop making the sysadmin's life so damned hard and
figure out how to do without all of these settings?
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by David Hildenbrand 3 weeks, 4 days ago
On 30.10.24 14:14, Matthew Wilcox wrote:
> On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
> 
> Or maybe we should stop making the sysadmin's life so damned hard and
> figure out how to do without all of these settings?

In particular if there is no proper problem description / use case.

-- 
Cheers,

David / dhildenb
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Johannes Weiner 3 weeks, 4 days ago
On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.
> 
> THP interface is added to memory cgroup subsystem. Existing global THP control
> semantics is supported for backward compatibility. When THP modes are set
> globally all the changes are propagated to memory cgroups. However, when a
> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> the same.
> 
> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> have completely the same format as global THP enabled/defrag.
> 
> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> cgroup mode changes aren't propagated to child cgroups.

Cgroups are for hierarchical resource distribution. It's tempting to
add parameters you would want for flat collections of processes, but
it gets weird when it comes to inheritance and hiearchical semantics
inside the cgroup tree - like it does here. So this is not a good fit.

On this particular issue, I agree with what Willy and David: let's not
proliferate THP knobs; let's focus on making them truly transparent.
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Stepanov Anatoly 3 weeks, 2 days ago
On 10/30/2024 6:08 PM, Johannes Weiner wrote:
> On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
>>
>> THP interface is added to memory cgroup subsystem. Existing global THP control
>> semantics is supported for backward compatibility. When THP modes are set
>> globally all the changes are propagated to memory cgroups. However, when a
>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>> the same.
>>
>> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>> have completely the same format as global THP enabled/defrag.
>>
>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>> cgroup mode changes aren't propagated to child cgroups.

> 
> Cgroups are for hierarchical resource distribution. It's tempting to
> add parameters you would want for flat collections of processes, but
> it gets weird when it comes to inheritance and hiearchical semantics
> inside the cgroup tree - like it does here. So this is not a good fit.
> 
> On this particular issue, I agree with what Willy and David: let's not
> proliferate THP knobs; let's focus on making them truly transparent.

We're also thinking about THP-limit direction (as mentioned in cover-letter)
Just have per-cgroup THP-limit and only global THP knobs, with couple additional global modes
(always-cgroup/madvise-cgroup).

"always-cgroup" for instance would enable THP for those tasks
which are attached to non-root memcg.

Per-cgroup THP limit might be used in combination with global THP knobs,
and in this we can maintain hiearchical semantics.

Now it's just an idea, may be it's better to have another RFC patch-set for this,
to be able to have more productive conversation.


-- 
Anatoly Stepanov, Huawei
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 4 days ago
On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.
> 
> THP interface is added to memory cgroup subsystem. Existing global THP control
> semantics is supported for backward compatibility. When THP modes are set
> globally all the changes are propagated to memory cgroups. However, when a
> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> the same.

Do you have any specific examples where this would be benefitial?

> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> have completely the same format as global THP enabled/defrag.
> 
> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> cgroup mode changes aren't propagated to child cgroups.

So this breaks hierarchical property, doesn't it? In other words if a
parent cgroup would like to enforce a certain policy to all descendants
then this is not really possible. 
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Gutierrez Asier 3 weeks, 4 days ago

On 10/30/2024 11:38 AM, Michal Hocko wrote:
> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
>>
>> THP interface is added to memory cgroup subsystem. Existing global THP control
>> semantics is supported for backward compatibility. When THP modes are set
>> globally all the changes are propagated to memory cgroups. However, when a
>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>> the same.
> 
> Do you have any specific examples where this would be benefitial?

Now we're mostly focused on database scenarios (MySQL, Redis).  

The main idea is to avoid using a global THP setting that can potentially waste 
overall resource and have per cgroup granularity.

Besides THP are being beneficial for DB performance, we observe high THP 
"over-usage" by some unrelated apps/services, when "always" mode is enabled 
globally.

With cgroup-THP, we're able to specify exact "THP-users", and plan to introduce
an ability to limit the amount of THPs per-cgroup.

We suppose it should be beneficial for some container-based workloads, when 
certain containers can have different THP-policies, but haven't looked into 
this case yet.

>> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>> have completely the same format as global THP enabled/defrag.
>>
>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>> cgroup mode changes aren't propagated to child cgroups.
> 
> So this breaks hierarchical property, doesn't it? In other words if a
> parent cgroup would like to enforce a certain policy to all descendants
> then this is not really possible. 

The first idea was to have some flexibility when changing THP policies. 

I will submit a new patch set which will enforce the cgroup hierarchy and change all
the children recursively.

-- 
Asier Gutierrez
Huawei
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 4 days ago
On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
> 
> 
> On 10/30/2024 11:38 AM, Michal Hocko wrote:
> > On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> >> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> >>
> >> Currently THP modes are set globally. It can be an overkill if only some
> >> specific app/set of apps need to get benefits from THP usage. Moreover, various
> >> apps might need different THP settings. Here we propose a cgroup-based THP
> >> control mechanism.
> >>
> >> THP interface is added to memory cgroup subsystem. Existing global THP control
> >> semantics is supported for backward compatibility. When THP modes are set
> >> globally all the changes are propagated to memory cgroups. However, when a
> >> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> >> the same.
> > 
> > Do you have any specific examples where this would be benefitial?
> 
> Now we're mostly focused on database scenarios (MySQL, Redis).  

That seems to be more process than workload oriented. Why the existing
per-process tuning doesn't work?

[...]
> >> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> >> cgroup mode changes aren't propagated to child cgroups.
> > 
> > So this breaks hierarchical property, doesn't it? In other words if a
> > parent cgroup would like to enforce a certain policy to all descendants
> > then this is not really possible. 
> 
> The first idea was to have some flexibility when changing THP policies. 
> 
> I will submit a new patch set which will enforce the cgroup hierarchy and change all
> the children recursively.

What is the expected semantics then?
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Gutierrez Asier 3 weeks, 4 days ago

On 10/30/2024 4:27 PM, Michal Hocko wrote:
> On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
>>
>>
>> On 10/30/2024 11:38 AM, Michal Hocko wrote:
>>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>>>
>>>> Currently THP modes are set globally. It can be an overkill if only some
>>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>>>> apps might need different THP settings. Here we propose a cgroup-based THP
>>>> control mechanism.
>>>>
>>>> THP interface is added to memory cgroup subsystem. Existing global THP control
>>>> semantics is supported for backward compatibility. When THP modes are set
>>>> globally all the changes are propagated to memory cgroups. However, when a
>>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>>>> the same.
>>>
>>> Do you have any specific examples where this would be benefitial?
>>
>> Now we're mostly focused on database scenarios (MySQL, Redis).  
> 
> That seems to be more process than workload oriented. Why the existing
> per-process tuning doesn't work?
> 
> [...]

1st Point

We're trying to provide a transparent mechanism, but all the existing per-process
methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

Moreover we're using file-backed THPs too (for .text mostly), which make it for
user-space developers even more complicated.

>>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>>>> cgroup mode changes aren't propagated to child cgroups.
>>>
>>> So this breaks hierarchical property, doesn't it? In other words if a
>>> parent cgroup would like to enforce a certain policy to all descendants
>>> then this is not really possible. 
>>
>> The first idea was to have some flexibility when changing THP policies. 
>>
>> I will submit a new patch set which will enforce the cgroup hierarchy and change all
>> the children recursively.
> 
> What is the expected semantics then?

2nd point (on semantics)
1. Children inherit the THP policy upon creation
2. Parent's policy changes are propagated to all the children
3. Children can set the policy independently

-- 
Asier Gutierrez
Huawei
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 4 days ago
On Wed 30-10-24 17:58:04, Gutierrez Asier wrote:
> 
> 
> On 10/30/2024 4:27 PM, Michal Hocko wrote:
> > On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
> >>
> >>
> >> On 10/30/2024 11:38 AM, Michal Hocko wrote:
> >>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> >>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> >>>>
> >>>> Currently THP modes are set globally. It can be an overkill if only some
> >>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
> >>>> apps might need different THP settings. Here we propose a cgroup-based THP
> >>>> control mechanism.
> >>>>
> >>>> THP interface is added to memory cgroup subsystem. Existing global THP control
> >>>> semantics is supported for backward compatibility. When THP modes are set
> >>>> globally all the changes are propagated to memory cgroups. However, when a
> >>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> >>>> the same.
> >>>
> >>> Do you have any specific examples where this would be benefitial?
> >>
> >> Now we're mostly focused on database scenarios (MySQL, Redis).  
> > 
> > That seems to be more process than workload oriented. Why the existing
> > per-process tuning doesn't work?
> > 
> > [...]
> 
> 1st Point
> 
> We're trying to provide a transparent mechanism, but all the existing per-process
> methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

There is also prctl to define per-process policy. We currently have
means to disable THP for the process to override the defeault behavior.
That would be mostly transparent for the application. 

You have not really answered a more fundamental question though. Why the
THP behavior should be at the cgroup scope? From a practical POV that
would represent containers which are a mixed bag of applications to
support the workload. Why does the same THP policy apply to all of them?
Doesn't this make the sub-optimal global behavior the same on the cgroup
level when some parts will benefit while others will not?

> Moreover we're using file-backed THPs too (for .text mostly), which make it for
> user-space developers even more complicated.
> 
> >>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> >>>> cgroup mode changes aren't propagated to child cgroups.
> >>>
> >>> So this breaks hierarchical property, doesn't it? In other words if a
> >>> parent cgroup would like to enforce a certain policy to all descendants
> >>> then this is not really possible. 
> >>
> >> The first idea was to have some flexibility when changing THP policies. 
> >>
> >> I will submit a new patch set which will enforce the cgroup hierarchy and change all
> >> the children recursively.
> > 
> > What is the expected semantics then?
> 
> 2nd point (on semantics)
> 1. Children inherit the THP policy upon creation
> 2. Parent's policy changes are propagated to all the children
> 3. Children can set the policy independently

So if the parent decides that none of the children should be using THP
they can override that so the tuning at parent has no imperative
control. This is breaking hierarchical property that is expected from
cgroup control files.
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Stepanov Anatoly 3 weeks, 3 days ago
On 10/30/2024 6:15 PM, Michal Hocko wrote:
> On Wed 30-10-24 17:58:04, Gutierrez Asier wrote:
>>
>>
>> On 10/30/2024 4:27 PM, Michal Hocko wrote:
>>> On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
>>>>
>>>>
>>>> On 10/30/2024 11:38 AM, Michal Hocko wrote:
>>>>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>>>>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>>>>>
>>>>>> Currently THP modes are set globally. It can be an overkill if only some
>>>>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>>>>>> apps might need different THP settings. Here we propose a cgroup-based THP
>>>>>> control mechanism.
>>>>>>
>>>>>> THP interface is added to memory cgroup subsystem. Existing global THP control
>>>>>> semantics is supported for backward compatibility. When THP modes are set
>>>>>> globally all the changes are propagated to memory cgroups. However, when a
>>>>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>>>>>> the same.
>>>>>
>>>>> Do you have any specific examples where this would be benefitial?
>>>>
>>>> Now we're mostly focused on database scenarios (MySQL, Redis).  
>>>
>>> That seems to be more process than workload oriented. Why the existing
>>> per-process tuning doesn't work?
>>>
>>> [...]
>>
>> 1st Point
>>
>> We're trying to provide a transparent mechanism, but all the existing per-process
>> methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

>
> There is also prctl to define per-process policy. We currently have
> means to disable THP for the process to override the defeault behavior.
> That would be mostly transparent for the application. 
(Answering as a co-author of the feature)

As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread,
it needs app. developer participation anyway.
In theory, kind of a launcher-process can be used, to utilize the inheritance
of the corresponding prctl THP setting, but this seems not transparent
for the user-space.

And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child)
tasks?

IMHO, an alternative approach would be changing per-process THP-mode by PID,
thus also avoiding any user app. changes.
But that kind of thing doesn't exist yet.
Anyway, it would require maintaining a set of PIDs for a specific group of processes,
that's also some extra-work for a sysadmin.

>
> You have not really answered a more fundamental question though. Why the
> THP behavior should be at the cgroup scope? From a practical POV that
> would represent containers which are a mixed bag of applications to
> support the workload. Why does the same THP policy apply to all of them?

For THP there're 3 possible levels of fine-control:
- global THP
  - THP per-group of processes
     - THP per-process

I agree, that in a container, different apps might have different
THP requirements. 
But it also depends on many factors, such as:
container "size"(tiny/huge container), diversity of apps/functions inside a container.
I mean, for some cases, we might not need to go below "per-group" level in terms of THP control.

>
> Doesn't this make the sub-optimal global behavior the same on the cgroup
> level when some parts will benefit while others will not?
>

I think the key idea for the sub-optimal behavior is "predictability",
so we know for sure which apps/services would consume THPs.
We observed a significant THP usage on almost idle Ubuntu server, with simple test running,
(some random system services consumed few hundreds Mb of THPs).
Of course, on other distros me might have different situation.
But with fine-grained per-group control it's a lot more predictable.

Am i got you question right? 


>> Moreover we're using file-backed THPs too (for .text mostly), which make it for
>> user-space developers even more complicated.
>>
>>>>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>>>>>> cgroup mode changes aren't propagated to child cgroups.
>>>>>
>>>>> So this breaks hierarchical property, doesn't it? In other words if a
>>>>> parent cgroup would like to enforce a certain policy to all descendants
>>>>> then this is not really possible. 
>>>>
>>>> The first idea was to have some flexibility when changing THP policies. 
>>>>
>>>> I will submit a new patch set which will enforce the cgroup hierarchy and change all
>>>> the children recursively.
>>>
>>> What is the expected semantics then?
>>
>> 2nd point (on semantics)
>> 1. Children inherit the THP policy upon creation
>> 2. Parent's policy changes are propagated to all the children
>> 3. Children can set the policy independently

>
> So if the parent decides that none of the children should be using THP
> they can override that so the tuning at parent has no imperative
> control. This is breaking hierarchical property that is expected from
> cgroup control files.

Actually, i think we can solve this.
As we mostly need just a single children level,
"flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting",
where sub-children are forbidden to override an inherited THP-mode.
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 3 days ago
On Thu 31-10-24 09:06:47, Stepanov Anatoly wrote:
[...]
> As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread,
> it needs app. developer participation anyway.
> In theory, kind of a launcher-process can be used, to utilize the inheritance
> of the corresponding prctl THP setting, but this seems not transparent
> for the user-space.

No, this is not in theaory. This is a very common usage pattern to allow
changing the behavior for the target application transparently.

> And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child)
> tasks?

This is what I've had in mind. Currently we only have THP disable
option. If we really need an override to enforce THP on an application
then this could be a more viable path.

> IMHO, an alternative approach would be changing per-process THP-mode by PID,
> thus also avoiding any user app. changes.

We already have process_madvise. MADV_HUGEPAGE resp. MADV_COLLAPSE are
not supported but we can discuss that option of course. This interface
requires much more orchestration of course because it is VMA range
based.

> > You have not really answered a more fundamental question though. Why the
> > THP behavior should be at the cgroup scope? From a practical POV that
> > would represent containers which are a mixed bag of applications to
> > support the workload. Why does the same THP policy apply to all of them?
> 
> For THP there're 3 possible levels of fine-control:
> - global THP
>   - THP per-group of processes
>      - THP per-process
> 
> I agree, that in a container, different apps might have different
> THP requirements. 
> But it also depends on many factors, such as:
> container "size"(tiny/huge container), diversity of apps/functions inside a container.
> I mean, for some cases, we might not need to go below "per-group" level in terms of THP control.

I am sorry but I do not really see any argument why this should be
per-memcg. Quite contrary. having that per memcg seems more muddy.

> > Doesn't this make the sub-optimal global behavior the same on the cgroup
> > level when some parts will benefit while others will not?
> >
> 
> I think the key idea for the sub-optimal behavior is "predictability",
> so we know for sure which apps/services would consume THPs.

OK, that seems fair.

> We observed a significant THP usage on almost idle Ubuntu server, with simple test running,
> (some random system services consumed few hundreds Mb of THPs).

I assume that you are using Always as global default configuration,
right? If that is the case then the high (in fact as high as feasible)
THP utilization is a real goal. If you want more targeted THP use then
madvise is what you are looking for. This will not help applications
which are not THP aware of course but then we are back to the discussion
whether the interface should be per a) per process b) per cgroup c)
process_madvise.

> Of course, on other distros me might have different situation.
> But with fine-grained per-group control it's a lot more predictable.
> 
> Am i got you question right? 

Not really but at least I do understand (hopefully) that you are trying
to workaround THP overuse by changing the global default to be more
restrictive while some workloads to be less restrictive. The question
why pushing that down to memcg scope makes the situation better is not
answered AFAICT.

[...]
> > So if the parent decides that none of the children should be using THP
> > they can override that so the tuning at parent has no imperative
> > control. This is breaking hierarchical property that is expected from
> > cgroup control files.
> 
> Actually, i think we can solve this.
> As we mostly need just a single children level,
> "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting",
> where sub-children are forbidden to override an inherited THP-mode.

This reduced case is not really sufficient to justify the non
hiearchical semantic, I am afraid. There must be a _really_ strong case
to break this property and even then I am rather skeptical to be honest.
We have been burnt by introducing stuff like memcg.swappiness that
seemed like a good idea initially but backfired with unexpected behavior
to many users.

-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Stepanov Anatoly 3 weeks, 3 days ago
On 10/31/2024 11:33 AM, Michal Hocko wrote:
> On Thu 31-10-24 09:06:47, Stepanov Anatoly wrote:
> [...]
>> As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread,
>> it needs app. developer participation anyway.
>> In theory, kind of a launcher-process can be used, to utilize the inheritance
>> of the corresponding prctl THP setting, but this seems not transparent
>> for the user-space.

> 
> No, this is not in theaory. This is a very common usage pattern to allow
> changing the behavior for the target application transparently.
> 
>> And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child)
>> tasks?
> 
> This is what I've had in mind. Currently we only have THP disable
> option. If we really need an override to enforce THP on an application
> then this could be a more viable path.
> 
>> IMHO, an alternative approach would be changing per-process THP-mode by PID,
>> thus also avoiding any user app. changes.

> 
> We already have process_madvise. MADV_HUGEPAGE resp. MADV_COLLAPSE are
> not supported but we can discuss that option of course. This interface
> requires much more orchestration of course because it is VMA range
> based.
> 
If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
THP mode property for several tasks at once, in this case some batch-change approach needed.

if, for example, process_madvise() would support task recursive logic, coupled with kind of
MADV_HUGE + *ITERATE_ALL_VMA*, it would be helpful.
In this case, the orchestration will be much easier.

>>> You have not really answered a more fundamental question though. Why the
>>> THP behavior should be at the cgroup scope? From a practical POV that
>>> would represent containers which are a mixed bag of applications to
>>> support the workload. Why does the same THP policy apply to all of them?
>>
>> For THP there're 3 possible levels of fine-control:
>> - global THP
>>   - THP per-group of processes
>>      - THP per-process
>>
>> I agree, that in a container, different apps might have different
>> THP requirements. 
>> But it also depends on many factors, such as:
>> container "size"(tiny/huge container), diversity of apps/functions inside a container.
>> I mean, for some cases, we might not need to go below "per-group" level in terms of THP control.
> 
> I am sorry but I do not really see any argument why this should be
> per-memcg. Quite contrary. having that per memcg seems more muddy.
> 
>>> Doesn't this make the sub-optimal global behavior the same on the cgroup
>>> level when some parts will benefit while others will not?
>>>
>>
>> I think the key idea for the sub-optimal behavior is "predictability",
>> so we know for sure which apps/services would consume THPs.
> 
> OK, that seems fair.
> 
>> We observed a significant THP usage on almost idle Ubuntu server, with simple test running,
>> (some random system services consumed few hundreds Mb of THPs).
> 
> I assume that you are using Always as global default configuration,
> right? If that is the case then the high (in fact as high as feasible)
> THP utilization is a real goal. If you want more targeted THP use then
> madvise is what you are looking for. This will not help applications
> which are not THP aware of course but then we are back to the discussion
> whether the interface should be per a) per process b) per cgroup c)
> process_madvise.
> 
>> Of course, on other distros me might have different situation.
>> But with fine-grained per-group control it's a lot more predictable.
>>
>> Am i got you question right? 

> 
> Not really but at least I do understand (hopefully) that you are trying
> to workaround THP overuse by changing the global default to be more
> restrictive while some workloads to be less restrictive. The question
> why pushing that down to memcg scope makes the situation better is not
> answered AFAICT.
>
Don't get us wrong, we're not trying to push this into memcg specifically. 
We're just trying to find a proper/friendly way to control
THP mode for a group of processes (which can be tasks without common parent).

May be if the process grouping logic were decoupled from hierarchical resource control
logic, it could be possible to gather multiple process, and batch-control some task properties.
But it would require to build kind of task properties system, where
a given set of properties can be flexibly assigned to one or more tasks.

Anyway, i think we gonna try alternative
approaches first.(prctl, process_madvise).
 
> [...]
>>> So if the parent decides that none of the children should be using THP
>>> they can override that so the tuning at parent has no imperative
>>> control. This is breaking hierarchical property that is expected from
>>> cgroup control files.
>>
>> Actually, i think we can solve this.
>> As we mostly need just a single children level,
>> "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting",
>> where sub-children are forbidden to override an inherited THP-mode.

> 
> This reduced case is not really sufficient to justify the non
> hiearchical semantic, I am afraid. There must be a _really_ strong case
> to break this property and even then I am rather skeptical to be honest.
> We have been burnt by introducing stuff like memcg.swappiness that
> seemed like a good idea initially but backfired with unexpected behavior
> to many users.
> 

-- 
Anatoly Stepanov, Huawei
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Matthew Wilcox 3 weeks, 2 days ago
On Thu, Oct 31, 2024 at 05:37:12PM +0300, Stepanov Anatoly wrote:
> Don't get us wrong, we're not trying to push this into memcg specifically. 
> We're just trying to find a proper/friendly way to control
> THP mode for a group of processes (which can be tasks without common parent).
> 
> May be if the process grouping logic were decoupled from hierarchical resource control
> logic, it could be possible to gather multiple process, and batch-control some task properties.
> But it would require to build kind of task properties system, where
> a given set of properties can be flexibly assigned to one or more tasks.
> 
> Anyway, i think we gonna try alternative
> approaches first.(prctl, process_madvise).

I oppose all of these approaches.  They are fundamentally misguided.
You're trying to blame sysadmins for our inadequacies as programmers.
All of this should be automatic.  Certainly the kernel will make mistakes
and not use the perfectly optimal size at all times, but it should be able
to get close to optimal.  Please, focus your efforts on allocating memory
of the right size, not on this fake problem of "we only have 235 THPs
available and we must make sure that the right process gets 183 of them".
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 2 days ago
On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> THP mode property for several tasks at once, in this case some batch-change approach needed.

I do not follow. How is this any different from a single process? Or do
you mean to change the mode for an already running process?

> if, for example, process_madvise() would support task recursive logic, coupled with kind of
> MADV_HUGE + *ITERATE_ALL_VMA*, it would be helpful.
> In this case, the orchestration will be much easier.

Nope, process_madvise is pidfd based interface and making it recursive
seems simply impossible for most operations as the address space is very
likely different in each child process.


-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Stepanov Anatoly 3 weeks, 2 days ago
On 11/1/2024 10:35 AM, Michal Hocko wrote:
> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>> THP mode property for several tasks at once, in this case some batch-change approach needed.
> 
> I do not follow. How is this any different from a single process? Or do
> you mean to change the mode for an already running process?
> 
yes, for already running set of processes


-- 
Anatoly Stepanov, Huawei
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 2 days ago
On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
> On 11/1/2024 10:35 AM, Michal Hocko wrote:
> > On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> >> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> >> THP mode property for several tasks at once, in this case some batch-change approach needed.
> > 
> > I do not follow. How is this any different from a single process? Or do
> > you mean to change the mode for an already running process?
> > 
> yes, for already running set of processes

Why is that preferred over setting the policy upfront?
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Stepanov Anatoly 3 weeks, 2 days ago
On 11/1/2024 4:15 PM, Michal Hocko wrote:
> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
>>>
>>> I do not follow. How is this any different from a single process? Or do
>>> you mean to change the mode for an already running process?
>>>
>> yes, for already running set of processes
> 

> Why is that preferred over setting the policy upfront?
Setting the policy in advance is fine, as the first step to do.
But we might not know in advance
which exact policy is the most beneficial for one set of apps or another.

So, i think it's better having an ability to change the policy
on the fly, without app./service restart.

-- 
Anatoly Stepanov, Huawei
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 2 days ago
On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
> On 11/1/2024 4:15 PM, Michal Hocko wrote:
> > On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
> >> On 11/1/2024 10:35 AM, Michal Hocko wrote:
> >>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> >>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> >>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
> >>>
> >>> I do not follow. How is this any different from a single process? Or do
> >>> you mean to change the mode for an already running process?
> >>>
> >> yes, for already running set of processes
> > 
> 
> > Why is that preferred over setting the policy upfront?
> Setting the policy in advance is fine, as the first step to do.
> But we might not know in advance
> which exact policy is the most beneficial for one set of apps or another.

How do you plan to find that out when the application is running
already?
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Stepanov Anatoly 3 weeks, 2 days ago
On 11/1/2024 4:28 PM, Michal Hocko wrote:
> On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
>> On 11/1/2024 4:15 PM, Michal Hocko wrote:
>>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
>>>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
>>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
>>>>>
>>>>> I do not follow. How is this any different from a single process? Or do
>>>>> you mean to change the mode for an already running process?
>>>>>
>>>> yes, for already running set of processes
>>>
>>
>>> Why is that preferred over setting the policy upfront?
>> Setting the policy in advance is fine, as the first step to do.
>> But we might not know in advance
>> which exact policy is the most beneficial for one set of apps or another.

> 
> How do you plan to find that out when the application is running
> already?
For example, if someone willing to compare some DB server performance with THP-off vs THP-on,
and DB server restart isn't an option.
Of course, if the restart is ok then we don't need such feature, "launcher" approach would be enough.
if i got your question right.


-- 
Anatoly Stepanov, Huawei
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Michal Hocko 3 weeks, 2 days ago
On Fri 01-11-24 16:39:07, Stepanov Anatoly wrote:
> On 11/1/2024 4:28 PM, Michal Hocko wrote:
> > On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
> >> On 11/1/2024 4:15 PM, Michal Hocko wrote:
> >>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
> >>>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
> >>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> >>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> >>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
> >>>>>
> >>>>> I do not follow. How is this any different from a single process? Or do
> >>>>> you mean to change the mode for an already running process?
> >>>>>
> >>>> yes, for already running set of processes
> >>>
> >>
> >>> Why is that preferred over setting the policy upfront?
> >> Setting the policy in advance is fine, as the first step to do.
> >> But we might not know in advance
> >> which exact policy is the most beneficial for one set of apps or another.
> 
> > 
> > How do you plan to find that out when the application is running
> > already?
> For example, if someone willing to compare some DB server performance with THP-off vs THP-on,
> and DB server restart isn't an option.

So you essentially expect user tell you that they want THP and you want
to make that happen on fly, correct? It is not like there is an actual
monitoring and dynamic policing.

If that is the case then I am not really convinced this is a worthwhile
to support TBH. I can see that a workload knows in advance that they
benefit from THP but I am much more dubious about "learning during the
runtime" is a real life thing. I might be wrong of course but if
somebody has performance monitoring that is able to identify performance
bottlenecks based on specific workload then applying THP on the whole
group of proceesses seems like a very crude way to deal with that. I
could see a case for madvice_process(MADV_COLLAPSE) to deal with
specific memory hotspots though.
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/3] Cgroup-based THP control
Posted by Stepanov Anatoly 3 weeks, 2 days ago
On 11/1/2024 4:50 PM, Michal Hocko wrote:
> On Fri 01-11-24 16:39:07, Stepanov Anatoly wrote:
>> On 11/1/2024 4:28 PM, Michal Hocko wrote:
>>> On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
>>>> On 11/1/2024 4:15 PM, Michal Hocko wrote:
>>>>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
>>>>>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
>>>>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>>>>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>>>>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
>>>>>>>
>>>>>>> I do not follow. How is this any different from a single process? Or do
>>>>>>> you mean to change the mode for an already running process?
>>>>>>>
>>>>>> yes, for already running set of processes
>>>>>
>>>>
>>>>> Why is that preferred over setting the policy upfront?
>>>> Setting the policy in advance is fine, as the first step to do.
>>>> But we might not know in advance
>>>> which exact policy is the most beneficial for one set of apps or another.
>>
>>>
>>> How do you plan to find that out when the application is running
>>> already?
>> For example, if someone willing to compare some DB server performance with THP-off vs THP-on,
>> and DB server restart isn't an option.

> 
> So you essentially expect user tell you that they want THP and you want
> to make that happen on fly, correct? It is not like there is an actual
> monitoring and dynamic policing.
For a user/sysadmin this scenario is almost the same as experimenting with 
global THP settings, but with explicit THP usage, less THP overuse by some random apps,
so more predictable.

> 
> If that is the case then I am not really convinced this is a worthwhile
> to support TBH. I can see that a workload knows in advance that they
> benefit from THP but I am much more dubious about "learning during the
> runtime" is a real life thing. I might be wrong of course but if
> somebody has performance monitoring that is able to identify performance
> bottlenecks based on specific workload then applying THP on the whole
> group of proceesses seems like a very crude way to deal with that. I

> could see a case for madvice_process(MADV_COLLAPSE) to deal with
> specific memory hotspots though.
Yes, we have something like this in mind.

-- 
Anatoly Stepanov, Huawei