[PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup

Vasily Averin posted 9 patches 3 years, 10 months ago
Only 0 patches received!
fs/kernfs/mount.c      | 6 ++++--
fs/xattr.c             | 2 +-
kernel/cgroup/cgroup.c | 2 +-
kernel/cgroup/rstat.c  | 3 ++-
kernel/sched/fair.c    | 4 ++--
kernel/sched/psi.c     | 2 +-
kernel/sched/rt.c      | 2 +-
mm/memcontrol.c        | 4 ++--
8 files changed, 14 insertions(+), 11 deletions(-)
[PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Vasily Averin 3 years, 10 months ago
In some cases, creating a cgroup allocates a noticeable amount of memory.
This operation can be executed from inside memory-limited container,
but currently this memory is not accounted to memcg and can be misused.
This allow container to exceed the assigned memory limit and avoid
memcg OOM. Moreover, in case of global memory shortage on the host,
the OOM-killer may not find a real memory eater and start killing
random processes on the host.

This is especially important for OpenVZ and LXC used on hosting,
where containers are used by untrusted end users.

Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
4cpu VM with Fedora and self-complied upstream kernel. The calculations
are not precise, it depends on kernel config options, number of cpus,
enabled controllers, ignores possible page allocations etc.
However this is enough to clarify the general situation.
All allocations are splitted into:
- common part, always called for each cgroup type
- per-cgroup allocations

In each group we consider 2 corner cases:
- usual allocations, important for 1-2 CPU nodes/Vms
- percpu allocations, important for 'big irons'

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu
cpu:		~2.5Kb	+ 1036 bytes percpu
cpuset:		~3Kb	+   12 bytes percpu
blkcg:		~3Kb	+   12 bytes percpu
pid:		~1.5Kb	+   12 bytes percpu		
perf:		 ~320b	+   60 bytes percpu
-------------------------------------------
total:		~38Kb	+ 6142 bytes percpu
currently accounted:	  4668 bytes percpu

- it's important to account usual allocations called
in common part, because almost all of cgroup-specific allocations
are small. One exception here is memory cgroup, it allocates a few
huge objects that should be accounted.
- Percpu allocation called in common part, in memcg and cpu cgroups
should be accounted, rest ones are small an can be ignored.
- KERNFS objects are allocated both in common part and in most of
cgroups 

Details can be found here:
https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/

I checked other cgroups types was found that they all can be ignored.
Additionally I found allocation of struct rt_rq called in cpu cgroup 
if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
percpu structure and should be accounted too.

v5:
 1) re-based to linux-mm (mm-everything-2022-06-22-20-36)

v4:
 1) re-based to linux-next (next-20220610)
   now psi_group is not a part of struct cgroup and is allocated on demand
 2) added received approval from Muchun Song
 3) improved cover letter description according to akpm@ request

v3:
 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7)
 2) fixed few typos
 3) added received approvals

v2:
 1) re-split to simplify possible bisect, re-ordered
 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu,
     allocated in common part
 3) added accounting for percpu allocation of struct rt_rq
     (actual if CONFIG_RT_GROUP_SCHED is enabled)
 4) improved patches descriptions 

Vasily Averin (9):
  memcg: enable accounting for struct cgroup
  memcg: enable accounting for kernfs nodes
  memcg: enable accounting for kernfs iattrs
  memcg: enable accounting for struct simple_xattr
  memcg: enable accounting for percpu allocation of struct psi_group_cpu
  memcg: enable accounting for percpu allocation of struct
    cgroup_rstat_cpu
  memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  memcg: enable accounting for allocations in alloc_fair_sched_group
  memcg: enable accounting for perpu allocation of struct rt_rq

 fs/kernfs/mount.c      | 6 ++++--
 fs/xattr.c             | 2 +-
 kernel/cgroup/cgroup.c | 2 +-
 kernel/cgroup/rstat.c  | 3 ++-
 kernel/sched/fair.c    | 4 ++--
 kernel/sched/psi.c     | 2 +-
 kernel/sched/rt.c      | 2 +-
 mm/memcontrol.c        | 4 ++--
 8 files changed, 14 insertions(+), 11 deletions(-)

-- 
2.36.1
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Vasily Averin 3 years, 10 months ago
Dear Michal,
do you still have any concerns about this patch set?

Thank you,
	Vasily Averin

On 6/23/22 17:50, Vasily Averin wrote:
> In some cases, creating a cgroup allocates a noticeable amount of memory.
> This operation can be executed from inside memory-limited container,
> but currently this memory is not accounted to memcg and can be misused.
> This allow container to exceed the assigned memory limit and avoid
> memcg OOM. Moreover, in case of global memory shortage on the host,
> the OOM-killer may not find a real memory eater and start killing
> random processes on the host.
> 
> This is especially important for OpenVZ and LXC used on hosting,
> where containers are used by untrusted end users.
> 
> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
> are not precise, it depends on kernel config options, number of cpus,
> enabled controllers, ignores possible page allocations etc.
> However this is enough to clarify the general situation.
> All allocations are splitted into:
> - common part, always called for each cgroup type
> - per-cgroup allocations
> 
> In each group we consider 2 corner cases:
> - usual allocations, important for 1-2 CPU nodes/Vms
> - percpu allocations, important for 'big irons'
> 
> common part: 	~11Kb	+  318 bytes percpu
> memcg: 		~17Kb	+ 4692 bytes percpu
> cpu:		~2.5Kb	+ 1036 bytes percpu
> cpuset:		~3Kb	+   12 bytes percpu
> blkcg:		~3Kb	+   12 bytes percpu
> pid:		~1.5Kb	+   12 bytes percpu		
> perf:		 ~320b	+   60 bytes percpu
> -------------------------------------------
> total:		~38Kb	+ 6142 bytes percpu
> currently accounted:	  4668 bytes percpu
> 
> - it's important to account usual allocations called
> in common part, because almost all of cgroup-specific allocations
> are small. One exception here is memory cgroup, it allocates a few
> huge objects that should be accounted.
> - Percpu allocation called in common part, in memcg and cpu cgroups
> should be accounted, rest ones are small an can be ignored.
> - KERNFS objects are allocated both in common part and in most of
> cgroups 
> 
> Details can be found here:
> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
> 
> I checked other cgroups types was found that they all can be ignored.
> Additionally I found allocation of struct rt_rq called in cpu cgroup 
> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
> percpu structure and should be accounted too.
> 
> v5:
>  1) re-based to linux-mm (mm-everything-2022-06-22-20-36)
> 
> v4:
>  1) re-based to linux-next (next-20220610)
>    now psi_group is not a part of struct cgroup and is allocated on demand
>  2) added received approval from Muchun Song
>  3) improved cover letter description according to akpm@ request
> 
> v3:
>  1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7)
>  2) fixed few typos
>  3) added received approvals
> 
> v2:
>  1) re-split to simplify possible bisect, re-ordered
>  2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu,
>      allocated in common part
>  3) added accounting for percpu allocation of struct rt_rq
>      (actual if CONFIG_RT_GROUP_SCHED is enabled)
>  4) improved patches descriptions 
> 
> Vasily Averin (9):
>   memcg: enable accounting for struct cgroup
>   memcg: enable accounting for kernfs nodes
>   memcg: enable accounting for kernfs iattrs
>   memcg: enable accounting for struct simple_xattr
>   memcg: enable accounting for percpu allocation of struct psi_group_cpu
>   memcg: enable accounting for percpu allocation of struct
>     cgroup_rstat_cpu
>   memcg: enable accounting for large allocations in mem_cgroup_css_alloc
>   memcg: enable accounting for allocations in alloc_fair_sched_group
>   memcg: enable accounting for perpu allocation of struct rt_rq
> 
>  fs/kernfs/mount.c      | 6 ++++--
>  fs/xattr.c             | 2 +-
>  kernel/cgroup/cgroup.c | 2 +-
>  kernel/cgroup/rstat.c  | 3 ++-
>  kernel/sched/fair.c    | 4 ++--
>  kernel/sched/psi.c     | 2 +-
>  kernel/sched/rt.c      | 2 +-
>  mm/memcontrol.c        | 4 ++--
>  8 files changed, 14 insertions(+), 11 deletions(-)
>
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Michal Hocko 3 years, 10 months ago
On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> Dear Michal,
> do you still have any concerns about this patch set?

Yes, I do not think we have concluded this to be really necessary. IIRC 
Roman would like to see lingering cgroups addressed in not-so-distant
future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
have a limit for the number of cgroups in the tree. So why should we
chase after allocations that correspond the cgroups and somehow try to
cap their number via the memory consumption. This looks like something
that will get out of sync eventually and it also doesn't seem like the
best control to me (comparing to an explicit limit to prevent runaways).
-- 
Michal Hocko
SUSE Labs
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Shakeel Butt 3 years, 10 months ago
On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> > Dear Michal,
> > do you still have any concerns about this patch set?
>
> Yes, I do not think we have concluded this to be really necessary. IIRC
> Roman would like to see lingering cgroups addressed in not-so-distant
> future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
> have a limit for the number of cgroups in the tree. So why should we
> chase after allocations that correspond the cgroups and somehow try to
> cap their number via the memory consumption. This looks like something
> that will get out of sync eventually and it also doesn't seem like the
> best control to me (comparing to an explicit limit to prevent runaways).
> --

Let me give a counter argument to that. On a system running multiple
workloads, how can the admin come up with a sensible limit for the
number of cgroups? There will definitely be jobs that require much
more number of sub-cgroups. Asking the admins to dynamically tune
another tuneable is just asking for more complications. At the end all
the users would just set it to max.

I would recommend to see the commit ac7b79fd190b ("inotify, memcg:
account inotify instances to kmemcg") where there is already a sysctl
(inotify/max_user_instances) to limit the number of instances but
there was no sensible way to set that limit on a multi-tenant system.
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Michal Hocko 3 years, 9 months ago
On Thu 23-06-22 09:55:33, Shakeel Butt wrote:
> On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> > > Dear Michal,
> > > do you still have any concerns about this patch set?
> >
> > Yes, I do not think we have concluded this to be really necessary. IIRC
> > Roman would like to see lingering cgroups addressed in not-so-distant
> > future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
> > have a limit for the number of cgroups in the tree. So why should we
> > chase after allocations that correspond the cgroups and somehow try to
> > cap their number via the memory consumption. This looks like something
> > that will get out of sync eventually and it also doesn't seem like the
> > best control to me (comparing to an explicit limit to prevent runaways).
> > --
> 
> Let me give a counter argument to that. On a system running multiple
> workloads, how can the admin come up with a sensible limit for the
> number of cgroups?

How is that any easier through memory consumption? Something that might
change between kernel versions? Is it even possible to prevent from id
depletion by the memory consumption? Any medium sized memcg can easily
consume all the ids AFAICS.
-- 
Michal Hocko
SUSE Labs
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Shakeel Butt 3 years, 9 months ago
On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 09:55:33, Shakeel Butt wrote:
> > On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> > > > Dear Michal,
> > > > do you still have any concerns about this patch set?
> > >
> > > Yes, I do not think we have concluded this to be really necessary. IIRC
> > > Roman would like to see lingering cgroups addressed in not-so-distant
> > > future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
> > > have a limit for the number of cgroups in the tree. So why should we
> > > chase after allocations that correspond the cgroups and somehow try to
> > > cap their number via the memory consumption. This looks like something
> > > that will get out of sync eventually and it also doesn't seem like the
> > > best control to me (comparing to an explicit limit to prevent runaways).
> > > --
> >
> > Let me give a counter argument to that. On a system running multiple
> > workloads, how can the admin come up with a sensible limit for the
> > number of cgroups?
>
> How is that any easier through memory consumption? Something that might
> change between kernel versions?

In v2, we do provide a way for admins to right size the containers
without killing them. Actually we are trying to use memory.high for
right sizing the jobs. (It is not the best but workable and there are
opportunities to improve it).

Similar mechanisms for other types of limits are lacking. Usually the
application would be getting the error for which it can not do
anything most of the time.

> Is it even possible to prevent from id
> depletion by the memory consumption? Any medium sized memcg can easily
> consume all the ids AFAICS.

Though the patch series is pitched as protection against OOMs, I think
it is beneficial irrespective. Protection against an adversarial actor
should not be the aim here. IMO this patch series improves the memory
association to the actual user which is better than unattributed
memory treated as system overhead.
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Michal Hocko 3 years, 9 months ago
On Mon 27-06-22 09:37:14, Shakeel Butt wrote:
> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Is it even possible to prevent from id
> > depletion by the memory consumption? Any medium sized memcg can easily
> > consume all the ids AFAICS.
> 
> Though the patch series is pitched as protection against OOMs, I think
> it is beneficial irrespective. Protection against an adversarial actor
> should not be the aim here. IMO this patch series improves the memory
> association to the actual user which is better than unattributed
> memory treated as system overhead.

Considering the amount of memory and "normal" cgroup usage (I guess we
can agree that delegated subtrees do not count their cgroups in
thousands) is this really something that is worth bothering with?

I mean, these patches are really small and not really disruptive so I do
not really see any problem with them. Except that they clearly add a
maintenance overhead. Not directly with the memory they track but any
future cgroup/memcg metadata related objects would need to be tracked as
well and I am worried this will get quickly out of sync. So we will have
a half assed solution in place that doesn't really help any containment
nor it provides a good and robust consumption tracking.

All that being said I find these changes rather without a great value or
use.

-- 
Michal Hocko
SUSE Labs
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Vasily Averin 3 years, 9 months ago
On 7/1/22 14:03, Michal Hocko wrote:
> On Mon 27-06-22 09:37:14, Shakeel Butt wrote:
>> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
>>> Is it even possible to prevent from id
>>> depletion by the memory consumption? Any medium sized memcg can easily
>>> consume all the ids AFAICS.
>>
>> Though the patch series is pitched as protection against OOMs, I think
>> it is beneficial irrespective. Protection against an adversarial actor
>> should not be the aim here. IMO this patch series improves the memory
>> association to the actual user which is better than unattributed
>> memory treated as system overhead.
> 
> Considering the amount of memory and "normal" cgroup usage (I guess we
> can agree that delegated subtrees do not count their cgroups in
> thousands) is this really something that is worth bothering with?
> 
> I mean, these patches are really small and not really disruptive so I do
> not really see any problem with them. Except that they clearly add a
> maintenance overhead. Not directly with the memory they track but any
> future cgroup/memcg metadata related objects would need to be tracked as
> well and I am worried this will get quickly out of sync. So we will have
> a half assed solution in place that doesn't really help any containment
> nor it provides a good and robust consumption tracking.
> 
> All that being said I find these changes rather without a great value or
> use.

Dear Michal,
I sill have 2 questions:
1) if you do not want to account any memory allocated for cgroup objects,
should you perhaps revert commit 3e38e0aaca9e "mm: memcg: charge memcg percpu
memory to the parent cgroup". Is it an exception perhaps?
(in fact I hope you will not revert this patch, I just would like to know 
your explanations about this accounting)
2) my patch set includes kernfs accounting required for proper netdevices accounting

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)	kernfs node
1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

 2/9] memcg: enable accounting for kernfs nodes
 3/9] memcg: enable accounting for kernfs iattrs
 4/9] memcg: enable accounting for struct simple_xattr

What do you think about them? Should I resend them as a new separate patch set?

Thank you,
	Vasily Averin
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Michal Hocko 3 years, 9 months ago
On Sun 10-07-22 21:53:34, Vasily Averin wrote:
> On 7/1/22 14:03, Michal Hocko wrote:
> > On Mon 27-06-22 09:37:14, Shakeel Butt wrote:
> >> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> >>> Is it even possible to prevent from id
> >>> depletion by the memory consumption? Any medium sized memcg can easily
> >>> consume all the ids AFAICS.
> >>
> >> Though the patch series is pitched as protection against OOMs, I think
> >> it is beneficial irrespective. Protection against an adversarial actor
> >> should not be the aim here. IMO this patch series improves the memory
> >> association to the actual user which is better than unattributed
> >> memory treated as system overhead.
> > 
> > Considering the amount of memory and "normal" cgroup usage (I guess we
> > can agree that delegated subtrees do not count their cgroups in
> > thousands) is this really something that is worth bothering with?
> > 
> > I mean, these patches are really small and not really disruptive so I do
> > not really see any problem with them. Except that they clearly add a
> > maintenance overhead. Not directly with the memory they track but any
> > future cgroup/memcg metadata related objects would need to be tracked as
> > well and I am worried this will get quickly out of sync. So we will have
> > a half assed solution in place that doesn't really help any containment
> > nor it provides a good and robust consumption tracking.
> > 
> > All that being said I find these changes rather without a great value or
> > use.
> 
> Dear Michal,
> I sill have 2 questions:
> 1) if you do not want to account any memory allocated for cgroup objects,
> should you perhaps revert commit 3e38e0aaca9e "mm: memcg: charge memcg percpu
> memory to the parent cgroup". Is it an exception perhaps?
> (in fact I hope you will not revert this patch, I just would like to know 
> your explanations about this accounting)

Well, I have to say I was not a great fan of this patch when it was
proposed but I didn't really have strong arguments against it to nack
it. It was simple enough, rather self contained in few places. Just to
give you an insight into my thinking here. Your patchseries is also not
something I would nack (nor I have done that). I am not super fan of it
either. I voiced against it because it just hit my internal thrashold of
how many different places are patched without any systemic approach. If
we consider that it doesn't really help with the initial intention to
protect against adversaries then what is the point of all the churn?

Others might think differently and if you can get acks by other
maintainers then I won't stand in the way. I have voiced my concerns and
I hope my thinking is clear now.

> 2) my patch set includes kernfs accounting required for proper netdevices accounting
> 
> Allocs  Alloc   Allocation
> number  size
> --------------------------------------------
> 1   +  128      (__kernfs_new_node+0x4d)	kernfs node
> 1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
> 1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
> 1       32      (simple_xattr_set+0x59)
> 1       8       (__kernfs_new_node+0x30)
> 
>  2/9] memcg: enable accounting for kernfs nodes
>  3/9] memcg: enable accounting for kernfs iattrs
>  4/9] memcg: enable accounting for struct simple_xattr
> 
> What do you think about them? Should I resend them as a new separate patch set?

kernfs is not really my area so I cannot really comment on those.

-- 
Michal Hocko
SUSE Labs
[PATCH RFC] memcg: notify about global mem_cgroup_id space depletion
Posted by Vasily Averin 3 years, 9 months ago
Currently host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot
start a new service, but nothing points to the real cause of
this failure.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..5229321636f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
 	if (memcg->id.id < 0) {
 		error = memcg->id.id;
+		pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
 		goto fail;
 	}
 
-- 
2.36.1
Re: [PATCH RFC] memcg: notify about global mem_cgroup_id space depletion
Posted by Roman Gushchin 3 years, 9 months ago
On Sat, Jun 25, 2022 at 05:04:27PM +0300, Vasily Averin wrote:
> Currently host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot
> start a new service, but nothing points to the real cause of
> this failure.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
>  mm/memcontrol.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..5229321636f2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>  	if (memcg->id.id < 0) {
>  		error = memcg->id.id;
> +		pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>  		goto fail;
>  	}

Hm, in this case it should return -ENOSPC and it's a very unique return code.
If it's not returned from the mkdir() call, we should fix this.
Otherwise it's up to systemd to handle it properly.

I'm not opposing for adding a warning, but parsing dmesg is not how
the error handling should be done.

Thanks!
Re: [PATCH RFC] memcg: notify about global mem_cgroup_id space depletion
Posted by Vasily Averin 3 years, 9 months ago
On 6/26/22 04:56, Roman Gushchin wrote:
> On Sat, Jun 25, 2022 at 05:04:27PM +0300, Vasily Averin wrote:
>> Currently host owner is not informed about the exhaustion of the
>> global mem_cgroup_id space. When this happens, systemd cannot
>> start a new service, but nothing points to the real cause of
>> this failure.
>>
>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>> ---
>>  mm/memcontrol.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d4c606a06bcd..5229321636f2 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>>  				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>>  	if (memcg->id.id < 0) {
>>  		error = memcg->id.id;
>> +		pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>>  		goto fail;
>>  	}
> 
> Hm, in this case it should return -ENOSPC and it's a very unique return code.
> If it's not returned from the mkdir() call, we should fix this.
> Otherwise it's up to systemd to handle it properly.
> 
> I'm not opposing for adding a warning, but parsing dmesg is not how
> the error handling should be done.

I'm agree,  I think it's a good idea. Moreover I think it makes sense to
use -ENOSPC  when the local cgroup's limit is reached.
Currently cgroup_mkdir() returns -EAGAIN, this looks strange for me.

        if (!cgroup_check_hierarchy_limits(parent)) {
                ret = -EAGAIN;
                goto out_unlock;
        }

Thank you,
	Vasily Averin
[PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Vasily Averin 3 years, 9 months ago
When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
not return -EAGAIN, but instead react similarly to reaching the global
limit.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1be0f81fe8e1..243239553ea3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 		return -ENODEV;
 
 	if (!cgroup_check_hierarchy_limits(parent)) {
-		ret = -EAGAIN;
+		ret = -ENOSPC;
 		goto out_unlock;
 	}
 
-- 
2.36.1
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Roman Gushchin 3 years, 9 months ago
On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
> not return -EAGAIN, but instead react similarly to reaching the global
> limit.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
>  kernel/cgroup/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1be0f81fe8e1..243239553ea3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
>  		return -ENODEV;
>  
>  	if (!cgroup_check_hierarchy_limits(parent)) {
> -		ret = -EAGAIN;
> +		ret = -ENOSPC;

I'd not argue whether ENOSPC is better or worse here, but I don't think we need
to change it now. It's been in this state for a long time and is a part of ABI.
EAGAIN is pretty unique as a mkdir() result, so systemd can handle it well.

Thanks!
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Vasily Averin 3 years, 9 months ago
On 6/28/22 03:44, Roman Gushchin wrote:
> On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
>> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
>> not return -EAGAIN, but instead react similarly to reaching the global
>> limit.
>>
>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>> ---
>>  kernel/cgroup/cgroup.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1be0f81fe8e1..243239553ea3 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
>>  		return -ENODEV;
>>  
>>  	if (!cgroup_check_hierarchy_limits(parent)) {
>> -		ret = -EAGAIN;
>> +		ret = -ENOSPC;
> 
> I'd not argue whether ENOSPC is better or worse here, but I don't think we need
> to change it now. It's been in this state for a long time and is a part of ABI.
> EAGAIN is pretty unique as a mkdir() result, so systemd can handle it well.

I would agree with you, however in my opinion EAGAIN is used to restart an
interrupted system call. Thus, I worry its return can loop the user space without
any chance of continuation.

However, maybe I'm confusing something?

Thank you,
	Vasily Averin
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Michal Koutný 3 years, 9 months ago
On Tue, Jun 28, 2022 at 06:59:06AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> I would agree with you, however in my opinion EAGAIN is used to restart an
> interrupted system call. Thus, I worry its return can loop the user space without
> any chance of continuation.
> 
> However, maybe I'm confusing something?

The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better
sense here. (And I suspect the dependency on this particular value won't
be very wide spread.)

0.02€
Michal
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Tejun Heo 3 years, 9 months ago
On Tue, Jun 28, 2022 at 11:16:48AM +0200, Michal Koutný wrote:
> The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better
> sense here. (And I suspect the dependency on this particular value won't
> be very wide spread.)

Given how we use these system calls as triggers for random kernel
operations, I don't think adhering to posix standard is necessary or
possible. Using an error code which isn't listed in the man page isn't
particularly high in the list of discrepancies.

Again, I'm not against changing it but I'd like to see better
rationales. On one side, we have "it's been this way for a long time
and there's nothing particularly broken about it". I'm not sure the
arguments we have for the other side is strong enough yet.

Thanks.

-- 
tejun
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Vasily Averin 3 years, 9 months ago
On 6/28/22 12:22, Tejun Heo wrote:
> On Tue, Jun 28, 2022 at 11:16:48AM +0200, Michal Koutný wrote:
>> The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better
>> sense here. (And I suspect the dependency on this particular value won't
>> be very wide spread.)
> 
> Given how we use these system calls as triggers for random kernel
> operations, I don't think adhering to posix standard is necessary or
> possible. Using an error code which isn't listed in the man page isn't
> particularly high in the list of discrepancies.
> 
> Again, I'm not against changing it but I'd like to see better
> rationales. On one side, we have "it's been this way for a long time
> and there's nothing particularly broken about it". I'm not sure the
> arguments we have for the other side is strong enough yet.

I would like to recall this patch.

I experimented on fedora36 node with LXC and centos stream 9 container.
and I did not noticed any critical systemd troubles with original -EAGAIN.
When cgroup's limit is reached systemd cannot start new services, 
for example lxc-attach generates following output:

[root@fc34-vvs ~]# lxc-attach c9s
lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc"
lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11
lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup
lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container

I did not found any loop in userspace caused by EAGAIN.
Messages looks unclear, however situation with the patched kernel is not much better:

[root@fc34-vvs ~]# lxc-attach c9s
lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc"
lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11
lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup
lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container

Thank you,
	Vasily Averin
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Tejun Heo 3 years, 9 months ago
On Wed, Jun 29, 2022 at 09:13:02AM +0300, Vasily Averin wrote:
> I experimented on fedora36 node with LXC and centos stream 9 container.
> and I did not noticed any critical systemd troubles with original -EAGAIN.
> When cgroup's limit is reached systemd cannot start new services, 
> for example lxc-attach generates following output:
> 
> [root@fc34-vvs ~]# lxc-attach c9s
> lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc"
> lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11
> lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup
> lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container
> 
> I did not found any loop in userspace caused by EAGAIN.
> Messages looks unclear, however situation with the patched kernel is not much better:
> 
> [root@fc34-vvs ~]# lxc-attach c9s
> lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc"
> lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11
> lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup
> lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container

I'd say "resource temporarily unavailable" is better fitting than "no
space left on device" and the syscall restart thing isn't handled by
-EAGAIN return value. Grep restart_block for that.

Thanks.

-- 
tejun
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Roman Gushchin 3 years, 9 months ago
On Thu, Jun 30, 2022 at 04:25:57AM +0900, Tejun Heo wrote:
> On Wed, Jun 29, 2022 at 09:13:02AM +0300, Vasily Averin wrote:
> > I experimented on fedora36 node with LXC and centos stream 9 container.
> > and I did not noticed any critical systemd troubles with original -EAGAIN.
> > When cgroup's limit is reached systemd cannot start new services, 
> > for example lxc-attach generates following output:
> > 
> > [root@fc34-vvs ~]# lxc-attach c9s
> > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc"
> > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11
> > lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup
> > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container
> > 
> > I did not found any loop in userspace caused by EAGAIN.
> > Messages looks unclear, however situation with the patched kernel is not much better:
> > 
> > [root@fc34-vvs ~]# lxc-attach c9s
> > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc"
> > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11
> > lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup
> > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container
> 
> I'd say "resource temporarily unavailable" is better fitting than "no
> space left on device"

+1

Thanks!
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Tejun Heo 3 years, 9 months ago
On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
> not return -EAGAIN, but instead react similarly to reaching the global
> limit.

While I'm not necessarily against this change, I find the rationale to
be somewhat lacking. Can you please elaborate why -ENOSPC is the right
one while -EAGAIN is incorrect?

Thanks.

-- 
tejun
Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
Posted by Muchun Song 3 years, 9 months ago
On Mon, Jun 27, 2022 at 10:12 AM Vasily Averin <vvs@openvz.org> wrote:
>
> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
> not return -EAGAIN, but instead react similarly to reaching the global
> limit.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.
[PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
Posted by Vasily Averin 3 years, 9 months ago
Currently, the host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot start a
new service and receives a unique -ENOSPC error code.
However, this can happen inside this container, persist in the log file
of the local container, and may not be noticed by the host owner if he
did not try to start any new services.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC
    if no free IDs could be found, but can also return -ENOMEM.
    Therefore error code check was added before message output and
    patch descriprion was adopted.
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..ffc6b5d6b95e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
 	if (memcg->id.id < 0) {
 		error = memcg->id.id;
+		if (error == -ENOSPC)
+			pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
 		goto fail;
 	}
 
-- 
2.36.1
Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
Posted by Muchun Song 3 years, 9 months ago
On Mon, Jun 27, 2022 at 10:11 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Currently, the host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot start a
> new service and receives a unique -ENOSPC error code.
> However, this can happen inside this container, persist in the log file
> of the local container, and may not be noticed by the host owner if he
> did not try to start any new services.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC

If the caller can know -ENOSPC is returned by mkdir(), then I
think the user (perhaps systemd) is the best place to throw out the
error message instead of in the kernel log. Right?

Thanks.

>     if no free IDs could be found, but can also return -ENOMEM.
>     Therefore error code check was added before message output and
>     patch descriprion was adopted.
> ---
>  mm/memcontrol.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..ffc6b5d6b95e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>                                  1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>         if (memcg->id.id < 0) {
>                 error = memcg->id.id;
> +               if (error == -ENOSPC)
> +                       pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>                 goto fail;
>         }
>
> --
> 2.36.1
>
Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
Posted by Vasily Averin 3 years, 9 months ago
On 6/27/22 06:23, Muchun Song wrote:
> If the caller can know -ENOSPC is returned by mkdir(), then I
> think the user (perhaps systemd) is the best place to throw out the
> error message instead of in the kernel log. Right?

Such an incident may occur inside the container.
OpenVZ nodes can host 300-400 containers, and the host admin cannot
monitor guest logs. the dmesg message is necessary to inform the host
owner that the global limit has been reached, otherwise he can
continue to believe that there are no problems on the node.

Thank you,
	Vasily Averin
Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
Posted by Roman Gushchin 3 years, 9 months ago
On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote:
> On 6/27/22 06:23, Muchun Song wrote:
> > If the caller can know -ENOSPC is returned by mkdir(), then I
> > think the user (perhaps systemd) is the best place to throw out the
> > error message instead of in the kernel log. Right?
> 
> Such an incident may occur inside the container.
> OpenVZ nodes can host 300-400 containers, and the host admin cannot
> monitor guest logs. the dmesg message is necessary to inform the host
> owner that the global limit has been reached, otherwise he can
> continue to believe that there are no problems on the node.

Why this is happening? It's hard to believe someone really needs that
many cgroups. Is this when somebody fails to delete old cgroups?

I wanted to say that it's better to introduce a memcg event, but then
I realized it's probably not worth the wasted space. Is this a common
scenario?

I think a better approach will be to add a cgroup event (displayed via
cgroup.events) about reaching the maximum limit of cgroups. E.g.
cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants
to some value below memcg_id space size. It's more work, but IMO it's
a better way to communicate this event. As a bonus, you can easily
get an idea which cgroup depletes the limit.

Thanks!
Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
Posted by Michal Koutný 3 years, 9 months ago
On Mon, Jun 27, 2022 at 06:11:27PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> I think a better approach will be to add a cgroup event (displayed via
> cgroup.events) about reaching the maximum limit of cgroups. E.g.
> cgroups.events::max_nr_reached.

This sounds like a good generalization.

> Then you can set cgroup.max.descendants to some value below memcg_id
> space size. It's more work, but IMO it's a better way to communicate
> this event. As a bonus, you can easily get an idea which cgroup
> depletes the limit.

Just mind there's a difference between events: what cgroup's limit was
hit and what cgroup was affected by the limit [1] (the former is more
useful for the calibration if I understand the situation).

Michal

[1] https://lore.kernel.org/all/20200205134426.10570-2-mkoutny@suse.com/
Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
Posted by Vasily Averin 3 years, 9 months ago
On 6/28/22 04:11, Roman Gushchin wrote:
> On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote:
>> On 6/27/22 06:23, Muchun Song wrote:
>>> If the caller can know -ENOSPC is returned by mkdir(), then I
>>> think the user (perhaps systemd) is the best place to throw out the
>>> error message instead of in the kernel log. Right?
>>
>> Such an incident may occur inside the container.
>> OpenVZ nodes can host 300-400 containers, and the host admin cannot
>> monitor guest logs. the dmesg message is necessary to inform the host
>> owner that the global limit has been reached, otherwise he can
>> continue to believe that there are no problems on the node.
> 
> Why this is happening? It's hard to believe someone really needs that
> many cgroups. Is this when somebody fails to delete old cgroups?

I do not have direct claims that some node really reached this limit,
however I saw crashdumps with 30000+ cgroups.

Theoretically OpenVz/LXC nodes can host up to several thousand containers
per node. Practically production nodes with 300-400 containers are a
common thing. I assume that each container can easily use up to 100-200
memory cgroups, and I think this is normal consumption.

Therefore, I believe that 64K limit is quite achievable in real life.
Primary goal of my patch is to confirm this theory.

> I wanted to say that it's better to introduce a memcg event, but then
> I realized it's probably not worth the wasted space. Is this a common
> scenario?
> 
> I think a better approach will be to add a cgroup event (displayed via
> cgroup.events) about reaching the maximum limit of cgroups. E.g.
> cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants
> to some value below memcg_id space size. It's more work, but IMO it's
> a better way to communicate this event. As a bonus, you can easily
> get an idea which cgroup depletes the limit.

For my goal (i.e. just to confirm that 64K limit was reached) this
functionality is too complicated. This confirmation is important because
it should push us to increase the global limit.

However, I think your idea is great, In perspective it helps both OpenVZ
and LXC and possibly Shakeel to understand the real memcg using and set
the proper limit for containers. 

I'm going to prepare such patches, however I'm not sure I'll have enough
time for this task in the near future.

Thank you,
	Vasily Averin
[PATCH RFC] memcg: avoid idr ids space depletion
Posted by Vasily Averin 3 years, 9 months ago
I tried to increase MEM_CGROUP_ID_MAX to INT_MAX and found no
significant difficulties. What do you think about following patch?
I did not tested it, just checked its compilation.
I hope it allows:
- to avoid memcg id space depletion on normal nodes
- to set up per-container cgroup limit to USHRT_MAX to prevent possible misuse
  and in general use memcg accounting for allocated resources.

Thank you,
	Vasily Averin
---

Michal Hocko pointed that memory controller depends on idr ids which
have a space that is rather limited
 #define MEM_CGROUP_ID_MAX       USHRT_MAX

The limit can be reached on nodes hosted several hundred OS containers
with new distributions running hundreds of services in their own memory
cgroups.

This patch increases the space up to INT_MAX.
---
 include/linux/memcontrol.h  | 15 +++++++++------
 include/linux/swap_cgroup.h | 14 +++++---------
 mm/memcontrol.c             |  6 +++---
 mm/swap_cgroup.c            | 10 ++++------
 4 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 744cde2b2368..e3468550ba20 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -59,10 +59,13 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
-
+#ifdef CONFIG_64BIT
+#define MEM_CGROUP_ID_SHIFT	31
+#define MEM_CGROUP_ID_MAX	INT_MAX - 1
+#else
 #define MEM_CGROUP_ID_SHIFT	16
 #define MEM_CGROUP_ID_MAX	USHRT_MAX
-
+#endif
 struct mem_cgroup_id {
 	int id;
 	refcount_t ref;
@@ -852,14 +855,14 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
 
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+static inline int mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
 		return 0;
 
 	return memcg->id.id;
 }
-struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
+struct mem_cgroup *mem_cgroup_from_id(int id);
 
 #ifdef CONFIG_SHRINKER_DEBUG
 static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
@@ -1374,12 +1377,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+static inline int mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
 }
 
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+static inline struct mem_cgroup *mem_cgroup_from_id(int id)
 {
 	WARN_ON_ONCE(id);
 	/* XXX: This should always return root_mem_cgroup */
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index a12dd1c3966c..711dd18380ed 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -6,25 +6,21 @@
 
 #ifdef CONFIG_MEMCG_SWAP
 
-extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-					 unsigned int nr_ents);
-extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern int swap_cgroup_cmpxchg(swp_entry_t ent, int old, int new);
+extern int swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents);
+extern int lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-				  unsigned int nr_ents)
+unsigned short swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents)
 {
 	return 0;
 }
 
-static inline
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
+static inline int lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 275d0c847f05..d4c606a06bcd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5224,7 +5224,7 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
  *
  * Caller must hold rcu_read_lock().
  */
-struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+struct mem_cgroup *mem_cgroup_from_id(int id)
 {
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	return idr_find(&mem_cgroup_idr, id);
@@ -7021,7 +7021,7 @@ int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
 {
 	struct folio *folio = page_folio(page);
 	struct mem_cgroup *memcg;
-	unsigned short id;
+	int id;
 	int ret;
 
 	if (mem_cgroup_disabled())
@@ -7541,7 +7541,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 {
 	struct mem_cgroup *memcg;
-	unsigned short id;
+	int id;
 
 	id = swap_cgroup_record(entry, 0, nr_pages);
 	rcu_read_lock();
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 5a9442979a18..76fa5c42e03f 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -15,7 +15,7 @@ struct swap_cgroup_ctrl {
 static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
 
 struct swap_cgroup {
-	unsigned short		id;
+	int		id;
 };
 #define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
 
@@ -94,8 +94,7 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
  * Returns old id at success, 0 at failure.
  * (There is no mem_cgroup using 0 as its id)
  */
-unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new)
+int swap_cgroup_cmpxchg(swp_entry_t ent, int old, int new)
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
@@ -123,8 +122,7 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-				  unsigned int nr_ents)
+int swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents)
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
@@ -159,7 +157,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
  *
  * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
  */
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
+int lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
-- 
2.36.1
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Vasily Averin 3 years, 9 months ago
On 6/23/22 19:55, Shakeel Butt wrote:
> On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Thu 23-06-22 18:03:31, Vasily Averin wrote:
>>> Dear Michal,
>>> do you still have any concerns about this patch set?
>>
>> Yes, I do not think we have concluded this to be really necessary. IIRC
>> Roman would like to see lingering cgroups addressed in not-so-distant
>> future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
>> have a limit for the number of cgroups in the tree. So why should we
>> chase after allocations that correspond the cgroups and somehow try to
>> cap their number via the memory consumption. This looks like something
>> that will get out of sync eventually and it also doesn't seem like the
>> best control to me (comparing to an explicit limit to prevent runaways).
>> --
> 
> Let me give a counter argument to that. On a system running multiple
> workloads, how can the admin come up with a sensible limit for the
> number of cgroups? There will definitely be jobs that require much
> more number of sub-cgroups. Asking the admins to dynamically tune
> another tuneable is just asking for more complications. At the end all
> the users would just set it to max.
> 
> I would recommend to see the commit ac7b79fd190b ("inotify, memcg:
> account inotify instances to kmemcg") where there is already a sysctl
> (inotify/max_user_instances) to limit the number of instances but
> there was no sensible way to set that limit on a multi-tenant system.

I've found that MEM_CGROUP_ID_MAX limits memory cgroups only. Other types
of cgroups do not have similar restrictions. Yes, we can set some per-container 
limit for all cgroups, but to me it looks like workaround while
proper memory accounting looks like real solution.

Btw could you please explain why memory cgroups have MEM_CGROUP_ID_MAX limit
Why it is required at all and why it was set to USHRT_MAX? I believe that
in the future it may be really reachable:

Let's set up per-container cgroup limit to some small numbers, 
for example to 512 as OpenVz doing right now. On real node with 300
containers we can easily get 100*300 = 30000 cgroups, and consume ~3Gb memory, 
without any misuse. I think it is too much to ignore its accounting.

Thank you,
	Vasily Averin
Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
Posted by Michal Koutný 3 years, 9 months ago
On Fri, Jun 24, 2022 at 01:40:14PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> Btw could you please explain why memory cgroups have MEM_CGROUP_ID_MAX limit
> Why it is required at all and why it was set to USHRT_MAX? I believe that
> in the future it may be really reachable:

IIRC, one reason is 2B * nr_swap_pages of memory overhead (in
swap_cgroup_swapon()) that's ~0.05% of swap space occupied additionally
in RAM (fortunately swap needn't cover whole RAM).

HTH,
Michal