include/linux/huge_mm.h | 23 +++- include/linux/khugepaged.h | 2 +- include/linux/memcontrol.h | 28 ++++ mm/huge_memory.c | 207 ++++++++++++++++++----------- mm/khugepaged.c | 8 +- mm/memcontrol.c | 262 +++++++++++++++++++++++++++++++++++++ 6 files changed, 449 insertions(+), 81 deletions(-)
From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Currently THP modes are set globally. It can be an overkill if only some specific app/set of apps need to get benefits from THP usage. Moreover, various apps might need different THP settings. Here we propose a cgroup-based THP control mechanism. THP interface is added to memory cgroup subsystem. Existing global THP control semantics is supported for backward compatibility. When THP modes are set globally all the changes are propagated to memory cgroups. However, when a particular cgroup changes its THP policy, the global THP policy in sysfs remains the same. New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which have completely the same format as global THP enabled/defrag. Child cgroups inherit THP settings from parent cgroup upon creation. Particular cgroup mode changes aren't propagated to child cgroups. During the memory cgroup attachment stage, the correct slots are added or removed to khugepaged according to the THP policy. Usage examples: Set globally "madvise" mode: # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled # cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never All the settings are propagated # cat /sys/fs/cgroup/memory.thp_enabled always [madvise] never # cat /sys/fs/cgroup/test/memory.thp_enabled always [madvise] never Set "always" for some specific cgroup: # echo always > /sys/fs/cgroup/test/memory.thp_enabled # cat /sys/fs/cgroup/test/memory.thp_enabled [always] madvise never Root cgroup remains with "madvise" mode: # cat /sys/fs/cgroup/memory.thp_enabled always [madvise] never When attempting to read global settings we get "mixed state" warning as the THP-mode isn't the same for every cgroup: # cat /sys/kernel/mm/transparent_hugepage/enabled Mixed state: see particular memcg flags! Again, set THP mode globally, make sure everything works fine: # echo never > /sys/kernel/mm/transparent_hugepage/enabled # cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] # cat /sys/fs/cgroup/memory.thp_enabled always madvise [never] # cat /sys/fs/cgroup/test/memory.thp_enabled always madvise [never] Here is a simple demo with a test which is doing anon. mmap() and a series of random reads. System is rebooted between the cases. Case 1: Global THP - always. No cgroup. // Global THP stats: AnonHugePages: 391168 kB FileHugePages: 120832 kB FilePmdMapped: 67584 kB // THP stats from *smaps* of the testing process AnonHugePages: 12288 kB Case 2: Global THP - never. Cgroup - always. // Global THP stats: AnonHugePages: 12288 kB FileHugePages: 2048 kB FilePmdMapped: 2048 kB // THP stats from *smaps* of the testing process AnonHugePages: 12288 kB // The cgroup THP stats anon_thp 12582912 file_thp 2097152 Obviously there's a huge difference between the two in terms of global THP usage, thus showing the cgroup approach is beneficial for such cases, when a specific app/set of apps needs THP, but not willing to change anything in the app. code. TODO list: 1. Anonymous mTHP 2. Fine-grained mode selection for different VMA types: "anon|exec|ro|file", to be able to support combinations as: "always + exec", "always + anon", etc. 3. Per-cgroup limit for the THP usage Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com> Reviewed-by: Alexander Kozhevnikov <alexander.kozhevnikov@huawei-partners.com> Asier Gutierrez, Anatoly Stepanov (3): mm: Add thp_flags control for cgroup mm: Support for huge pages in cgroups mm: Add thp_defrag control for cgroup include/linux/huge_mm.h | 23 +++- include/linux/khugepaged.h | 2 +- include/linux/memcontrol.h | 28 ++++ mm/huge_memory.c | 207 ++++++++++++++++++----------- mm/khugepaged.c | 8 +- mm/memcontrol.c | 262 +++++++++++++++++++++++++++++++++++++ 6 files changed, 449 insertions(+), 81 deletions(-) -- 2.34.1
gutierrez.asier@huawei-partners.com writes: >New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which >have completely the same format as global THP enabled/defrag. cgroup controls exist because there are things we want to do for an entire class of processes (group OOM, resource control, etc). Enabling or disabling some specific setting is generally not one of them, hence why we got rid of things like per-cgroup vm.swappiness. We know that these controls do not compose well and have caused a lot of pain in the past. So my immediate reaction is a nack on the general concept, unless there's some absolutely compelling case here. I talked a little at Kernel Recipes last year about moving away from sysctl and other global interfaces and making things more granular. Don't get me wrong, I think that is a good thing (although, of course, a very large undertaking) -- but it is a mistake to overload the amount of controls we expose as part of the cgroup interface. I am up for thinking overall about how we can improve the state of global tunables to make them more granular overall, but this can't set a precedent as the way to do it.
On Wed 30-10-24 14:45:24, Chris Down wrote: > gutierrez.asier@huawei-partners.com writes: > > New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which > > have completely the same format as global THP enabled/defrag. > > cgroup controls exist because there are things we want to do for an entire > class of processes (group OOM, resource control, etc). Enabling or disabling > some specific setting is generally not one of them, hence why we got rid of > things like per-cgroup vm.swappiness. We know that these controls do not > compose well and have caused a lot of pain in the past. So my immediate > reaction is a nack on the general concept, unless there's some absolutely > compelling case here. > > I talked a little at Kernel Recipes last year about moving away from sysctl > and other global interfaces and making things more granular. Don't get me > wrong, I think that is a good thing (although, of course, a very large > undertaking) -- but it is a mistake to overload the amount of controls we > expose as part of the cgroup interface. Completely agreed! -- Michal Hocko SUSE Labs
On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote: > From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> > > Currently THP modes are set globally. It can be an overkill if only some > specific app/set of apps need to get benefits from THP usage. Moreover, various > apps might need different THP settings. Here we propose a cgroup-based THP > control mechanism. Or maybe we should stop making the sysadmin's life so damned hard and figure out how to do without all of these settings?
On 30.10.24 14:14, Matthew Wilcox wrote: > On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote: >> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> >> >> Currently THP modes are set globally. It can be an overkill if only some >> specific app/set of apps need to get benefits from THP usage. Moreover, various >> apps might need different THP settings. Here we propose a cgroup-based THP >> control mechanism. > > Or maybe we should stop making the sysadmin's life so damned hard and > figure out how to do without all of these settings? In particular if there is no proper problem description / use case. -- Cheers, David / dhildenb
On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote: > From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> > > Currently THP modes are set globally. It can be an overkill if only some > specific app/set of apps need to get benefits from THP usage. Moreover, various > apps might need different THP settings. Here we propose a cgroup-based THP > control mechanism. > > THP interface is added to memory cgroup subsystem. Existing global THP control > semantics is supported for backward compatibility. When THP modes are set > globally all the changes are propagated to memory cgroups. However, when a > particular cgroup changes its THP policy, the global THP policy in sysfs remains > the same. > > New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which > have completely the same format as global THP enabled/defrag. > > Child cgroups inherit THP settings from parent cgroup upon creation. Particular > cgroup mode changes aren't propagated to child cgroups. Cgroups are for hierarchical resource distribution. It's tempting to add parameters you would want for flat collections of processes, but it gets weird when it comes to inheritance and hiearchical semantics inside the cgroup tree - like it does here. So this is not a good fit. On this particular issue, I agree with what Willy and David: let's not proliferate THP knobs; let's focus on making them truly transparent.
On 10/30/2024 6:08 PM, Johannes Weiner wrote: > On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote: >> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> >> >> Currently THP modes are set globally. It can be an overkill if only some >> specific app/set of apps need to get benefits from THP usage. Moreover, various >> apps might need different THP settings. Here we propose a cgroup-based THP >> control mechanism. >> >> THP interface is added to memory cgroup subsystem. Existing global THP control >> semantics is supported for backward compatibility. When THP modes are set >> globally all the changes are propagated to memory cgroups. However, when a >> particular cgroup changes its THP policy, the global THP policy in sysfs remains >> the same. >> >> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which >> have completely the same format as global THP enabled/defrag. >> >> Child cgroups inherit THP settings from parent cgroup upon creation. Particular >> cgroup mode changes aren't propagated to child cgroups. > > Cgroups are for hierarchical resource distribution. It's tempting to > add parameters you would want for flat collections of processes, but > it gets weird when it comes to inheritance and hiearchical semantics > inside the cgroup tree - like it does here. So this is not a good fit. > > On this particular issue, I agree with what Willy and David: let's not > proliferate THP knobs; let's focus on making them truly transparent. We're also thinking about THP-limit direction (as mentioned in cover-letter) Just have per-cgroup THP-limit and only global THP knobs, with couple additional global modes (always-cgroup/madvise-cgroup). "always-cgroup" for instance would enable THP for those tasks which are attached to non-root memcg. Per-cgroup THP limit might be used in combination with global THP knobs, and in this we can maintain hiearchical semantics. Now it's just an idea, may be it's better to have another RFC patch-set for this, to be able to have more productive conversation. -- Anatoly Stepanov, Huawei
On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote: > From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> > > Currently THP modes are set globally. It can be an overkill if only some > specific app/set of apps need to get benefits from THP usage. Moreover, various > apps might need different THP settings. Here we propose a cgroup-based THP > control mechanism. > > THP interface is added to memory cgroup subsystem. Existing global THP control > semantics is supported for backward compatibility. When THP modes are set > globally all the changes are propagated to memory cgroups. However, when a > particular cgroup changes its THP policy, the global THP policy in sysfs remains > the same. Do you have any specific examples where this would be benefitial? > New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which > have completely the same format as global THP enabled/defrag. > > Child cgroups inherit THP settings from parent cgroup upon creation. Particular > cgroup mode changes aren't propagated to child cgroups. So this breaks hierarchical property, doesn't it? In other words if a parent cgroup would like to enforce a certain policy to all descendants then this is not really possible. -- Michal Hocko SUSE Labs
On 10/30/2024 11:38 AM, Michal Hocko wrote: > On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote: >> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> >> >> Currently THP modes are set globally. It can be an overkill if only some >> specific app/set of apps need to get benefits from THP usage. Moreover, various >> apps might need different THP settings. Here we propose a cgroup-based THP >> control mechanism. >> >> THP interface is added to memory cgroup subsystem. Existing global THP control >> semantics is supported for backward compatibility. When THP modes are set >> globally all the changes are propagated to memory cgroups. However, when a >> particular cgroup changes its THP policy, the global THP policy in sysfs remains >> the same. > > Do you have any specific examples where this would be benefitial? Now we're mostly focused on database scenarios (MySQL, Redis). The main idea is to avoid using a global THP setting that can potentially waste overall resource and have per cgroup granularity. Besides THP are being beneficial for DB performance, we observe high THP "over-usage" by some unrelated apps/services, when "always" mode is enabled globally. With cgroup-THP, we're able to specify exact "THP-users", and plan to introduce an ability to limit the amount of THPs per-cgroup. We suppose it should be beneficial for some container-based workloads, when certain containers can have different THP-policies, but haven't looked into this case yet. >> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which >> have completely the same format as global THP enabled/defrag. >> >> Child cgroups inherit THP settings from parent cgroup upon creation. Particular >> cgroup mode changes aren't propagated to child cgroups. > > So this breaks hierarchical property, doesn't it? In other words if a > parent cgroup would like to enforce a certain policy to all descendants > then this is not really possible. The first idea was to have some flexibility when changing THP policies. I will submit a new patch set which will enforce the cgroup hierarchy and change all the children recursively. -- Asier Gutierrez Huawei
On Wed 30-10-24 15:51:00, Gutierrez Asier wrote: > > > On 10/30/2024 11:38 AM, Michal Hocko wrote: > > On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote: > >> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> > >> > >> Currently THP modes are set globally. It can be an overkill if only some > >> specific app/set of apps need to get benefits from THP usage. Moreover, various > >> apps might need different THP settings. Here we propose a cgroup-based THP > >> control mechanism. > >> > >> THP interface is added to memory cgroup subsystem. Existing global THP control > >> semantics is supported for backward compatibility. When THP modes are set > >> globally all the changes are propagated to memory cgroups. However, when a > >> particular cgroup changes its THP policy, the global THP policy in sysfs remains > >> the same. > > > > Do you have any specific examples where this would be benefitial? > > Now we're mostly focused on database scenarios (MySQL, Redis). That seems to be more process than workload oriented. Why the existing per-process tuning doesn't work? [...] > >> Child cgroups inherit THP settings from parent cgroup upon creation. Particular > >> cgroup mode changes aren't propagated to child cgroups. > > > > So this breaks hierarchical property, doesn't it? In other words if a > > parent cgroup would like to enforce a certain policy to all descendants > > then this is not really possible. > > The first idea was to have some flexibility when changing THP policies. > > I will submit a new patch set which will enforce the cgroup hierarchy and change all > the children recursively. What is the expected semantics then? -- Michal Hocko SUSE Labs
On 10/30/2024 4:27 PM, Michal Hocko wrote: > On Wed 30-10-24 15:51:00, Gutierrez Asier wrote: >> >> >> On 10/30/2024 11:38 AM, Michal Hocko wrote: >>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote: >>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> >>>> >>>> Currently THP modes are set globally. It can be an overkill if only some >>>> specific app/set of apps need to get benefits from THP usage. Moreover, various >>>> apps might need different THP settings. Here we propose a cgroup-based THP >>>> control mechanism. >>>> >>>> THP interface is added to memory cgroup subsystem. Existing global THP control >>>> semantics is supported for backward compatibility. When THP modes are set >>>> globally all the changes are propagated to memory cgroups. However, when a >>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains >>>> the same. >>> >>> Do you have any specific examples where this would be benefitial? >> >> Now we're mostly focused on database scenarios (MySQL, Redis). > > That seems to be more process than workload oriented. Why the existing > per-process tuning doesn't work? > > [...] 1st Point We're trying to provide a transparent mechanism, but all the existing per-process methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs) Moreover we're using file-backed THPs too (for .text mostly), which make it for user-space developers even more complicated. >>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular >>>> cgroup mode changes aren't propagated to child cgroups. >>> >>> So this breaks hierarchical property, doesn't it? In other words if a >>> parent cgroup would like to enforce a certain policy to all descendants >>> then this is not really possible. >> >> The first idea was to have some flexibility when changing THP policies. >> >> I will submit a new patch set which will enforce the cgroup hierarchy and change all >> the children recursively. > > What is the expected semantics then? 2nd point (on semantics) 1. Children inherit the THP policy upon creation 2. Parent's policy changes are propagated to all the children 3. Children can set the policy independently -- Asier Gutierrez Huawei
On Wed 30-10-24 17:58:04, Gutierrez Asier wrote: > > > On 10/30/2024 4:27 PM, Michal Hocko wrote: > > On Wed 30-10-24 15:51:00, Gutierrez Asier wrote: > >> > >> > >> On 10/30/2024 11:38 AM, Michal Hocko wrote: > >>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote: > >>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> > >>>> > >>>> Currently THP modes are set globally. It can be an overkill if only some > >>>> specific app/set of apps need to get benefits from THP usage. Moreover, various > >>>> apps might need different THP settings. Here we propose a cgroup-based THP > >>>> control mechanism. > >>>> > >>>> THP interface is added to memory cgroup subsystem. Existing global THP control > >>>> semantics is supported for backward compatibility. When THP modes are set > >>>> globally all the changes are propagated to memory cgroups. However, when a > >>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains > >>>> the same. > >>> > >>> Do you have any specific examples where this would be benefitial? > >> > >> Now we're mostly focused on database scenarios (MySQL, Redis). > > > > That seems to be more process than workload oriented. Why the existing > > per-process tuning doesn't work? > > > > [...] > > 1st Point > > We're trying to provide a transparent mechanism, but all the existing per-process > methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs) There is also prctl to define per-process policy. We currently have means to disable THP for the process to override the defeault behavior. That would be mostly transparent for the application. You have not really answered a more fundamental question though. Why the THP behavior should be at the cgroup scope? From a practical POV that would represent containers which are a mixed bag of applications to support the workload. Why does the same THP policy apply to all of them? Doesn't this make the sub-optimal global behavior the same on the cgroup level when some parts will benefit while others will not? > Moreover we're using file-backed THPs too (for .text mostly), which make it for > user-space developers even more complicated. > > >>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular > >>>> cgroup mode changes aren't propagated to child cgroups. > >>> > >>> So this breaks hierarchical property, doesn't it? In other words if a > >>> parent cgroup would like to enforce a certain policy to all descendants > >>> then this is not really possible. > >> > >> The first idea was to have some flexibility when changing THP policies. > >> > >> I will submit a new patch set which will enforce the cgroup hierarchy and change all > >> the children recursively. > > > > What is the expected semantics then? > > 2nd point (on semantics) > 1. Children inherit the THP policy upon creation > 2. Parent's policy changes are propagated to all the children > 3. Children can set the policy independently So if the parent decides that none of the children should be using THP they can override that so the tuning at parent has no imperative control. This is breaking hierarchical property that is expected from cgroup control files. -- Michal Hocko SUSE Labs
On 10/30/2024 6:15 PM, Michal Hocko wrote: > On Wed 30-10-24 17:58:04, Gutierrez Asier wrote: >> >> >> On 10/30/2024 4:27 PM, Michal Hocko wrote: >>> On Wed 30-10-24 15:51:00, Gutierrez Asier wrote: >>>> >>>> >>>> On 10/30/2024 11:38 AM, Michal Hocko wrote: >>>>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote: >>>>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com> >>>>>> >>>>>> Currently THP modes are set globally. It can be an overkill if only some >>>>>> specific app/set of apps need to get benefits from THP usage. Moreover, various >>>>>> apps might need different THP settings. Here we propose a cgroup-based THP >>>>>> control mechanism. >>>>>> >>>>>> THP interface is added to memory cgroup subsystem. Existing global THP control >>>>>> semantics is supported for backward compatibility. When THP modes are set >>>>>> globally all the changes are propagated to memory cgroups. However, when a >>>>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains >>>>>> the same. >>>>> >>>>> Do you have any specific examples where this would be benefitial? >>>> >>>> Now we're mostly focused on database scenarios (MySQL, Redis). >>> >>> That seems to be more process than workload oriented. Why the existing >>> per-process tuning doesn't work? >>> >>> [...] >> >> 1st Point >> >> We're trying to provide a transparent mechanism, but all the existing per-process >> methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs) > > There is also prctl to define per-process policy. We currently have > means to disable THP for the process to override the defeault behavior. > That would be mostly transparent for the application. (Answering as a co-author of the feature) As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread, it needs app. developer participation anyway. In theory, kind of a launcher-process can be used, to utilize the inheritance of the corresponding prctl THP setting, but this seems not transparent for the user-space. And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child) tasks? IMHO, an alternative approach would be changing per-process THP-mode by PID, thus also avoiding any user app. changes. But that kind of thing doesn't exist yet. Anyway, it would require maintaining a set of PIDs for a specific group of processes, that's also some extra-work for a sysadmin. > > You have not really answered a more fundamental question though. Why the > THP behavior should be at the cgroup scope? From a practical POV that > would represent containers which are a mixed bag of applications to > support the workload. Why does the same THP policy apply to all of them? For THP there're 3 possible levels of fine-control: - global THP - THP per-group of processes - THP per-process I agree, that in a container, different apps might have different THP requirements. But it also depends on many factors, such as: container "size"(tiny/huge container), diversity of apps/functions inside a container. I mean, for some cases, we might not need to go below "per-group" level in terms of THP control. > > Doesn't this make the sub-optimal global behavior the same on the cgroup > level when some parts will benefit while others will not? > I think the key idea for the sub-optimal behavior is "predictability", so we know for sure which apps/services would consume THPs. We observed a significant THP usage on almost idle Ubuntu server, with simple test running, (some random system services consumed few hundreds Mb of THPs). Of course, on other distros me might have different situation. But with fine-grained per-group control it's a lot more predictable. Am i got you question right? >> Moreover we're using file-backed THPs too (for .text mostly), which make it for >> user-space developers even more complicated. >> >>>>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular >>>>>> cgroup mode changes aren't propagated to child cgroups. >>>>> >>>>> So this breaks hierarchical property, doesn't it? In other words if a >>>>> parent cgroup would like to enforce a certain policy to all descendants >>>>> then this is not really possible. >>>> >>>> The first idea was to have some flexibility when changing THP policies. >>>> >>>> I will submit a new patch set which will enforce the cgroup hierarchy and change all >>>> the children recursively. >>> >>> What is the expected semantics then? >> >> 2nd point (on semantics) >> 1. Children inherit the THP policy upon creation >> 2. Parent's policy changes are propagated to all the children >> 3. Children can set the policy independently > > So if the parent decides that none of the children should be using THP > they can override that so the tuning at parent has no imperative > control. This is breaking hierarchical property that is expected from > cgroup control files. Actually, i think we can solve this. As we mostly need just a single children level, "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting", where sub-children are forbidden to override an inherited THP-mode.
On Thu 31-10-24 09:06:47, Stepanov Anatoly wrote: [...] > As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread, > it needs app. developer participation anyway. > In theory, kind of a launcher-process can be used, to utilize the inheritance > of the corresponding prctl THP setting, but this seems not transparent > for the user-space. No, this is not in theaory. This is a very common usage pattern to allow changing the behavior for the target application transparently. > And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child) > tasks? This is what I've had in mind. Currently we only have THP disable option. If we really need an override to enforce THP on an application then this could be a more viable path. > IMHO, an alternative approach would be changing per-process THP-mode by PID, > thus also avoiding any user app. changes. We already have process_madvise. MADV_HUGEPAGE resp. MADV_COLLAPSE are not supported but we can discuss that option of course. This interface requires much more orchestration of course because it is VMA range based. > > You have not really answered a more fundamental question though. Why the > > THP behavior should be at the cgroup scope? From a practical POV that > > would represent containers which are a mixed bag of applications to > > support the workload. Why does the same THP policy apply to all of them? > > For THP there're 3 possible levels of fine-control: > - global THP > - THP per-group of processes > - THP per-process > > I agree, that in a container, different apps might have different > THP requirements. > But it also depends on many factors, such as: > container "size"(tiny/huge container), diversity of apps/functions inside a container. > I mean, for some cases, we might not need to go below "per-group" level in terms of THP control. I am sorry but I do not really see any argument why this should be per-memcg. Quite contrary. having that per memcg seems more muddy. > > Doesn't this make the sub-optimal global behavior the same on the cgroup > > level when some parts will benefit while others will not? > > > > I think the key idea for the sub-optimal behavior is "predictability", > so we know for sure which apps/services would consume THPs. OK, that seems fair. > We observed a significant THP usage on almost idle Ubuntu server, with simple test running, > (some random system services consumed few hundreds Mb of THPs). I assume that you are using Always as global default configuration, right? If that is the case then the high (in fact as high as feasible) THP utilization is a real goal. If you want more targeted THP use then madvise is what you are looking for. This will not help applications which are not THP aware of course but then we are back to the discussion whether the interface should be per a) per process b) per cgroup c) process_madvise. > Of course, on other distros me might have different situation. > But with fine-grained per-group control it's a lot more predictable. > > Am i got you question right? Not really but at least I do understand (hopefully) that you are trying to workaround THP overuse by changing the global default to be more restrictive while some workloads to be less restrictive. The question why pushing that down to memcg scope makes the situation better is not answered AFAICT. [...] > > So if the parent decides that none of the children should be using THP > > they can override that so the tuning at parent has no imperative > > control. This is breaking hierarchical property that is expected from > > cgroup control files. > > Actually, i think we can solve this. > As we mostly need just a single children level, > "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting", > where sub-children are forbidden to override an inherited THP-mode. This reduced case is not really sufficient to justify the non hiearchical semantic, I am afraid. There must be a _really_ strong case to break this property and even then I am rather skeptical to be honest. We have been burnt by introducing stuff like memcg.swappiness that seemed like a good idea initially but backfired with unexpected behavior to many users. -- Michal Hocko SUSE Labs
On 10/31/2024 11:33 AM, Michal Hocko wrote: > On Thu 31-10-24 09:06:47, Stepanov Anatoly wrote: > [...] >> As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread, >> it needs app. developer participation anyway. >> In theory, kind of a launcher-process can be used, to utilize the inheritance >> of the corresponding prctl THP setting, but this seems not transparent >> for the user-space. > > No, this is not in theaory. This is a very common usage pattern to allow > changing the behavior for the target application transparently. > >> And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child) >> tasks? > > This is what I've had in mind. Currently we only have THP disable > option. If we really need an override to enforce THP on an application > then this could be a more viable path. > >> IMHO, an alternative approach would be changing per-process THP-mode by PID, >> thus also avoiding any user app. changes. > > We already have process_madvise. MADV_HUGEPAGE resp. MADV_COLLAPSE are > not supported but we can discuss that option of course. This interface > requires much more orchestration of course because it is VMA range > based. > If we consider the inheritance approach (prctl + launcher), it's fine until we need to change THP mode property for several tasks at once, in this case some batch-change approach needed. if, for example, process_madvise() would support task recursive logic, coupled with kind of MADV_HUGE + *ITERATE_ALL_VMA*, it would be helpful. In this case, the orchestration will be much easier. >>> You have not really answered a more fundamental question though. Why the >>> THP behavior should be at the cgroup scope? From a practical POV that >>> would represent containers which are a mixed bag of applications to >>> support the workload. Why does the same THP policy apply to all of them? >> >> For THP there're 3 possible levels of fine-control: >> - global THP >> - THP per-group of processes >> - THP per-process >> >> I agree, that in a container, different apps might have different >> THP requirements. >> But it also depends on many factors, such as: >> container "size"(tiny/huge container), diversity of apps/functions inside a container. >> I mean, for some cases, we might not need to go below "per-group" level in terms of THP control. > > I am sorry but I do not really see any argument why this should be > per-memcg. Quite contrary. having that per memcg seems more muddy. > >>> Doesn't this make the sub-optimal global behavior the same on the cgroup >>> level when some parts will benefit while others will not? >>> >> >> I think the key idea for the sub-optimal behavior is "predictability", >> so we know for sure which apps/services would consume THPs. > > OK, that seems fair. > >> We observed a significant THP usage on almost idle Ubuntu server, with simple test running, >> (some random system services consumed few hundreds Mb of THPs). > > I assume that you are using Always as global default configuration, > right? If that is the case then the high (in fact as high as feasible) > THP utilization is a real goal. If you want more targeted THP use then > madvise is what you are looking for. This will not help applications > which are not THP aware of course but then we are back to the discussion > whether the interface should be per a) per process b) per cgroup c) > process_madvise. > >> Of course, on other distros me might have different situation. >> But with fine-grained per-group control it's a lot more predictable. >> >> Am i got you question right? > > Not really but at least I do understand (hopefully) that you are trying > to workaround THP overuse by changing the global default to be more > restrictive while some workloads to be less restrictive. The question > why pushing that down to memcg scope makes the situation better is not > answered AFAICT. > Don't get us wrong, we're not trying to push this into memcg specifically. We're just trying to find a proper/friendly way to control THP mode for a group of processes (which can be tasks without common parent). May be if the process grouping logic were decoupled from hierarchical resource control logic, it could be possible to gather multiple process, and batch-control some task properties. But it would require to build kind of task properties system, where a given set of properties can be flexibly assigned to one or more tasks. Anyway, i think we gonna try alternative approaches first.(prctl, process_madvise). > [...] >>> So if the parent decides that none of the children should be using THP >>> they can override that so the tuning at parent has no imperative >>> control. This is breaking hierarchical property that is expected from >>> cgroup control files. >> >> Actually, i think we can solve this. >> As we mostly need just a single children level, >> "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting", >> where sub-children are forbidden to override an inherited THP-mode. > > This reduced case is not really sufficient to justify the non > hiearchical semantic, I am afraid. There must be a _really_ strong case > to break this property and even then I am rather skeptical to be honest. > We have been burnt by introducing stuff like memcg.swappiness that > seemed like a good idea initially but backfired with unexpected behavior > to many users. > -- Anatoly Stepanov, Huawei
On Thu, Oct 31, 2024 at 05:37:12PM +0300, Stepanov Anatoly wrote: > Don't get us wrong, we're not trying to push this into memcg specifically. > We're just trying to find a proper/friendly way to control > THP mode for a group of processes (which can be tasks without common parent). > > May be if the process grouping logic were decoupled from hierarchical resource control > logic, it could be possible to gather multiple process, and batch-control some task properties. > But it would require to build kind of task properties system, where > a given set of properties can be flexibly assigned to one or more tasks. > > Anyway, i think we gonna try alternative > approaches first.(prctl, process_madvise). I oppose all of these approaches. They are fundamentally misguided. You're trying to blame sysadmins for our inadequacies as programmers. All of this should be automatic. Certainly the kernel will make mistakes and not use the perfectly optimal size at all times, but it should be able to get close to optimal. Please, focus your efforts on allocating memory of the right size, not on this fake problem of "we only have 235 THPs available and we must make sure that the right process gets 183 of them".
On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: > If we consider the inheritance approach (prctl + launcher), it's fine until we need to change > THP mode property for several tasks at once, in this case some batch-change approach needed. I do not follow. How is this any different from a single process? Or do you mean to change the mode for an already running process? > if, for example, process_madvise() would support task recursive logic, coupled with kind of > MADV_HUGE + *ITERATE_ALL_VMA*, it would be helpful. > In this case, the orchestration will be much easier. Nope, process_madvise is pidfd based interface and making it recursive seems simply impossible for most operations as the address space is very likely different in each child process. -- Michal Hocko SUSE Labs
On 11/1/2024 10:35 AM, Michal Hocko wrote: > On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: >> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change >> THP mode property for several tasks at once, in this case some batch-change approach needed. > > I do not follow. How is this any different from a single process? Or do > you mean to change the mode for an already running process? > yes, for already running set of processes -- Anatoly Stepanov, Huawei
On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote: > On 11/1/2024 10:35 AM, Michal Hocko wrote: > > On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: > >> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change > >> THP mode property for several tasks at once, in this case some batch-change approach needed. > > > > I do not follow. How is this any different from a single process? Or do > > you mean to change the mode for an already running process? > > > yes, for already running set of processes Why is that preferred over setting the policy upfront? -- Michal Hocko SUSE Labs
On 11/1/2024 4:15 PM, Michal Hocko wrote: > On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote: >> On 11/1/2024 10:35 AM, Michal Hocko wrote: >>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: >>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change >>>> THP mode property for several tasks at once, in this case some batch-change approach needed. >>> >>> I do not follow. How is this any different from a single process? Or do >>> you mean to change the mode for an already running process? >>> >> yes, for already running set of processes > > Why is that preferred over setting the policy upfront? Setting the policy in advance is fine, as the first step to do. But we might not know in advance which exact policy is the most beneficial for one set of apps or another. So, i think it's better having an ability to change the policy on the fly, without app./service restart. -- Anatoly Stepanov, Huawei
On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote: > On 11/1/2024 4:15 PM, Michal Hocko wrote: > > On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote: > >> On 11/1/2024 10:35 AM, Michal Hocko wrote: > >>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: > >>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change > >>>> THP mode property for several tasks at once, in this case some batch-change approach needed. > >>> > >>> I do not follow. How is this any different from a single process? Or do > >>> you mean to change the mode for an already running process? > >>> > >> yes, for already running set of processes > > > > > Why is that preferred over setting the policy upfront? > Setting the policy in advance is fine, as the first step to do. > But we might not know in advance > which exact policy is the most beneficial for one set of apps or another. How do you plan to find that out when the application is running already? -- Michal Hocko SUSE Labs
On 11/1/2024 4:28 PM, Michal Hocko wrote: > On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote: >> On 11/1/2024 4:15 PM, Michal Hocko wrote: >>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote: >>>> On 11/1/2024 10:35 AM, Michal Hocko wrote: >>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: >>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change >>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed. >>>>> >>>>> I do not follow. How is this any different from a single process? Or do >>>>> you mean to change the mode for an already running process? >>>>> >>>> yes, for already running set of processes >>> >> >>> Why is that preferred over setting the policy upfront? >> Setting the policy in advance is fine, as the first step to do. >> But we might not know in advance >> which exact policy is the most beneficial for one set of apps or another. > > How do you plan to find that out when the application is running > already? For example, if someone willing to compare some DB server performance with THP-off vs THP-on, and DB server restart isn't an option. Of course, if the restart is ok then we don't need such feature, "launcher" approach would be enough. if i got your question right. -- Anatoly Stepanov, Huawei
On Fri 01-11-24 16:39:07, Stepanov Anatoly wrote: > On 11/1/2024 4:28 PM, Michal Hocko wrote: > > On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote: > >> On 11/1/2024 4:15 PM, Michal Hocko wrote: > >>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote: > >>>> On 11/1/2024 10:35 AM, Michal Hocko wrote: > >>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: > >>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change > >>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed. > >>>>> > >>>>> I do not follow. How is this any different from a single process? Or do > >>>>> you mean to change the mode for an already running process? > >>>>> > >>>> yes, for already running set of processes > >>> > >> > >>> Why is that preferred over setting the policy upfront? > >> Setting the policy in advance is fine, as the first step to do. > >> But we might not know in advance > >> which exact policy is the most beneficial for one set of apps or another. > > > > > How do you plan to find that out when the application is running > > already? > For example, if someone willing to compare some DB server performance with THP-off vs THP-on, > and DB server restart isn't an option. So you essentially expect user tell you that they want THP and you want to make that happen on fly, correct? It is not like there is an actual monitoring and dynamic policing. If that is the case then I am not really convinced this is a worthwhile to support TBH. I can see that a workload knows in advance that they benefit from THP but I am much more dubious about "learning during the runtime" is a real life thing. I might be wrong of course but if somebody has performance monitoring that is able to identify performance bottlenecks based on specific workload then applying THP on the whole group of proceesses seems like a very crude way to deal with that. I could see a case for madvice_process(MADV_COLLAPSE) to deal with specific memory hotspots though. -- Michal Hocko SUSE Labs
On 11/1/2024 4:50 PM, Michal Hocko wrote: > On Fri 01-11-24 16:39:07, Stepanov Anatoly wrote: >> On 11/1/2024 4:28 PM, Michal Hocko wrote: >>> On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote: >>>> On 11/1/2024 4:15 PM, Michal Hocko wrote: >>>>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote: >>>>>> On 11/1/2024 10:35 AM, Michal Hocko wrote: >>>>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote: >>>>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change >>>>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed. >>>>>>> >>>>>>> I do not follow. How is this any different from a single process? Or do >>>>>>> you mean to change the mode for an already running process? >>>>>>> >>>>>> yes, for already running set of processes >>>>> >>>> >>>>> Why is that preferred over setting the policy upfront? >>>> Setting the policy in advance is fine, as the first step to do. >>>> But we might not know in advance >>>> which exact policy is the most beneficial for one set of apps or another. >> >>> >>> How do you plan to find that out when the application is running >>> already? >> For example, if someone willing to compare some DB server performance with THP-off vs THP-on, >> and DB server restart isn't an option. > > So you essentially expect user tell you that they want THP and you want > to make that happen on fly, correct? It is not like there is an actual > monitoring and dynamic policing. For a user/sysadmin this scenario is almost the same as experimenting with global THP settings, but with explicit THP usage, less THP overuse by some random apps, so more predictable. > > If that is the case then I am not really convinced this is a worthwhile > to support TBH. I can see that a workload knows in advance that they > benefit from THP but I am much more dubious about "learning during the > runtime" is a real life thing. I might be wrong of course but if > somebody has performance monitoring that is able to identify performance > bottlenecks based on specific workload then applying THP on the whole > group of proceesses seems like a very crude way to deal with that. I > could see a case for madvice_process(MADV_COLLAPSE) to deal with > specific memory hotspots though. Yes, we have something like this in mind. -- Anatoly Stepanov, Huawei
© 2016 - 2024 Red Hat, Inc.