[v2] memcg: notify about global mem_cgroup_id space depletion

[PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion

Posted by Vasily Averin 3 years, 9 months ago

Currently, the host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot start a
new service and receives a unique -ENOSPC error code.
However, this can happen inside this container, persist in the log file
of the local container, and may not be noticed by the host owner if he
did not try to start any new services.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC
    if no free IDs could be found, but can also return -ENOMEM.
    Therefore error code check was added before message output and
    patch descriprion was adopted.
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..ffc6b5d6b95e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
 	if (memcg->id.id < 0) {
 		error = memcg->id.id;
+		if (error == -ENOSPC)
+			pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
 		goto fail;
 	}
 
-- 
2.36.1

Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion

Posted by Muchun Song 3 years, 9 months ago

On Mon, Jun 27, 2022 at 10:11 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Currently, the host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot start a
> new service and receives a unique -ENOSPC error code.
> However, this can happen inside this container, persist in the log file
> of the local container, and may not be noticed by the host owner if he
> did not try to start any new services.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC

If the caller can know -ENOSPC is returned by mkdir(), then I
think the user (perhaps systemd) is the best place to throw out the
error message instead of in the kernel log. Right?

Thanks.

>     if no free IDs could be found, but can also return -ENOMEM.
>     Therefore error code check was added before message output and
>     patch descriprion was adopted.
> ---
>  mm/memcontrol.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..ffc6b5d6b95e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>                                  1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>         if (memcg->id.id < 0) {
>                 error = memcg->id.id;
> +               if (error == -ENOSPC)
> +                       pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>                 goto fail;
>         }
>
> --
> 2.36.1
>

Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion

Posted by Vasily Averin 3 years, 9 months ago

On 6/27/22 06:23, Muchun Song wrote:
> If the caller can know -ENOSPC is returned by mkdir(), then I
> think the user (perhaps systemd) is the best place to throw out the
> error message instead of in the kernel log. Right?

Such an incident may occur inside the container.
OpenVZ nodes can host 300-400 containers, and the host admin cannot
monitor guest logs. the dmesg message is necessary to inform the host
owner that the global limit has been reached, otherwise he can
continue to believe that there are no problems on the node.

Thank you,
	Vasily Averin

Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion

Posted by Roman Gushchin 3 years, 9 months ago

On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote:
> On 6/27/22 06:23, Muchun Song wrote:
> > If the caller can know -ENOSPC is returned by mkdir(), then I
> > think the user (perhaps systemd) is the best place to throw out the
> > error message instead of in the kernel log. Right?
> 
> Such an incident may occur inside the container.
> OpenVZ nodes can host 300-400 containers, and the host admin cannot
> monitor guest logs. the dmesg message is necessary to inform the host
> owner that the global limit has been reached, otherwise he can
> continue to believe that there are no problems on the node.

Why this is happening? It's hard to believe someone really needs that
many cgroups. Is this when somebody fails to delete old cgroups?

I wanted to say that it's better to introduce a memcg event, but then
I realized it's probably not worth the wasted space. Is this a common
scenario?

I think a better approach will be to add a cgroup event (displayed via
cgroup.events) about reaching the maximum limit of cgroups. E.g.
cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants
to some value below memcg_id space size. It's more work, but IMO it's
a better way to communicate this event. As a bonus, you can easily
get an idea which cgroup depletes the limit.

Thanks!

Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion

Posted by Michal Koutný 3 years, 9 months ago

On Mon, Jun 27, 2022 at 06:11:27PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> I think a better approach will be to add a cgroup event (displayed via
> cgroup.events) about reaching the maximum limit of cgroups. E.g.
> cgroups.events::max_nr_reached.

This sounds like a good generalization.

> Then you can set cgroup.max.descendants to some value below memcg_id
> space size. It's more work, but IMO it's a better way to communicate
> this event. As a bonus, you can easily get an idea which cgroup
> depletes the limit.

Just mind there's a difference between events: what cgroup's limit was
hit and what cgroup was affected by the limit [1] (the former is more
useful for the calibration if I understand the situation).

Michal

[1] https://lore.kernel.org/all/20200205134426.10570-2-mkoutny@suse.com/

Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion

Posted by Vasily Averin 3 years, 9 months ago

On 6/28/22 04:11, Roman Gushchin wrote:
> On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote:
>> On 6/27/22 06:23, Muchun Song wrote:
>>> If the caller can know -ENOSPC is returned by mkdir(), then I
>>> think the user (perhaps systemd) is the best place to throw out the
>>> error message instead of in the kernel log. Right?
>>
>> Such an incident may occur inside the container.
>> OpenVZ nodes can host 300-400 containers, and the host admin cannot
>> monitor guest logs. the dmesg message is necessary to inform the host
>> owner that the global limit has been reached, otherwise he can
>> continue to believe that there are no problems on the node.
> 
> Why this is happening? It's hard to believe someone really needs that
> many cgroups. Is this when somebody fails to delete old cgroups?

I do not have direct claims that some node really reached this limit,
however I saw crashdumps with 30000+ cgroups.

Theoretically OpenVz/LXC nodes can host up to several thousand containers
per node. Practically production nodes with 300-400 containers are a
common thing. I assume that each container can easily use up to 100-200
memory cgroups, and I think this is normal consumption.

Therefore, I believe that 64K limit is quite achievable in real life.
Primary goal of my patch is to confirm this theory.

> I wanted to say that it's better to introduce a memcg event, but then
> I realized it's probably not worth the wasted space. Is this a common
> scenario?
> 
> I think a better approach will be to add a cgroup event (displayed via
> cgroup.events) about reaching the maximum limit of cgroups. E.g.
> cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants
> to some value below memcg_id space size. It's more work, but IMO it's
> a better way to communicate this event. As a bonus, you can easily
> get an idea which cgroup depletes the limit.

For my goal (i.e. just to confirm that 64K limit was reached) this
functionality is too complicated. This confirmation is important because
it should push us to increase the global limit.

However, I think your idea is great, In perspective it helps both OpenVZ
and LXC and possibly Shakeel to understand the real memcg using and set
the proper limit for containers. 

I'm going to prepare such patches, however I'm not sure I'll have enough
time for this task in the near future.

Thank you,
	Vasily Averin