mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+)
Currently, the host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot start a
new service and receives a unique -ENOSPC error code.
However, this can happen inside this container, persist in the log file
of the local container, and may not be noticed by the host owner if he
did not try to start any new services.
Signed-off-by: Vasily Averin <vvs@openvz.org>
---
v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC
if no free IDs could be found, but can also return -ENOMEM.
Therefore error code check was added before message output and
patch descriprion was adopted.
---
mm/memcontrol.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..ffc6b5d6b95e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
if (memcg->id.id < 0) {
error = memcg->id.id;
+ if (error == -ENOSPC)
+ pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
goto fail;
}
--
2.36.1
On Mon, Jun 27, 2022 at 10:11 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Currently, the host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot start a
> new service and receives a unique -ENOSPC error code.
> However, this can happen inside this container, persist in the log file
> of the local container, and may not be noticed by the host owner if he
> did not try to start any new services.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC
If the caller can know -ENOSPC is returned by mkdir(), then I
think the user (perhaps systemd) is the best place to throw out the
error message instead of in the kernel log. Right?
Thanks.
> if no free IDs could be found, but can also return -ENOMEM.
> Therefore error code check was added before message output and
> patch descriprion was adopted.
> ---
> mm/memcontrol.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..ffc6b5d6b95e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
> if (memcg->id.id < 0) {
> error = memcg->id.id;
> + if (error == -ENOSPC)
> + pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
> goto fail;
> }
>
> --
> 2.36.1
>
On 6/27/22 06:23, Muchun Song wrote: > If the caller can know -ENOSPC is returned by mkdir(), then I > think the user (perhaps systemd) is the best place to throw out the > error message instead of in the kernel log. Right? Such an incident may occur inside the container. OpenVZ nodes can host 300-400 containers, and the host admin cannot monitor guest logs. the dmesg message is necessary to inform the host owner that the global limit has been reached, otherwise he can continue to believe that there are no problems on the node. Thank you, Vasily Averin
On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote: > On 6/27/22 06:23, Muchun Song wrote: > > If the caller can know -ENOSPC is returned by mkdir(), then I > > think the user (perhaps systemd) is the best place to throw out the > > error message instead of in the kernel log. Right? > > Such an incident may occur inside the container. > OpenVZ nodes can host 300-400 containers, and the host admin cannot > monitor guest logs. the dmesg message is necessary to inform the host > owner that the global limit has been reached, otherwise he can > continue to believe that there are no problems on the node. Why this is happening? It's hard to believe someone really needs that many cgroups. Is this when somebody fails to delete old cgroups? I wanted to say that it's better to introduce a memcg event, but then I realized it's probably not worth the wasted space. Is this a common scenario? I think a better approach will be to add a cgroup event (displayed via cgroup.events) about reaching the maximum limit of cgroups. E.g. cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants to some value below memcg_id space size. It's more work, but IMO it's a better way to communicate this event. As a bonus, you can easily get an idea which cgroup depletes the limit. Thanks!
On Mon, Jun 27, 2022 at 06:11:27PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote: > I think a better approach will be to add a cgroup event (displayed via > cgroup.events) about reaching the maximum limit of cgroups. E.g. > cgroups.events::max_nr_reached. This sounds like a good generalization. > Then you can set cgroup.max.descendants to some value below memcg_id > space size. It's more work, but IMO it's a better way to communicate > this event. As a bonus, you can easily get an idea which cgroup > depletes the limit. Just mind there's a difference between events: what cgroup's limit was hit and what cgroup was affected by the limit [1] (the former is more useful for the calibration if I understand the situation). Michal [1] https://lore.kernel.org/all/20200205134426.10570-2-mkoutny@suse.com/
On 6/28/22 04:11, Roman Gushchin wrote: > On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote: >> On 6/27/22 06:23, Muchun Song wrote: >>> If the caller can know -ENOSPC is returned by mkdir(), then I >>> think the user (perhaps systemd) is the best place to throw out the >>> error message instead of in the kernel log. Right? >> >> Such an incident may occur inside the container. >> OpenVZ nodes can host 300-400 containers, and the host admin cannot >> monitor guest logs. the dmesg message is necessary to inform the host >> owner that the global limit has been reached, otherwise he can >> continue to believe that there are no problems on the node. > > Why this is happening? It's hard to believe someone really needs that > many cgroups. Is this when somebody fails to delete old cgroups? I do not have direct claims that some node really reached this limit, however I saw crashdumps with 30000+ cgroups. Theoretically OpenVz/LXC nodes can host up to several thousand containers per node. Practically production nodes with 300-400 containers are a common thing. I assume that each container can easily use up to 100-200 memory cgroups, and I think this is normal consumption. Therefore, I believe that 64K limit is quite achievable in real life. Primary goal of my patch is to confirm this theory. > I wanted to say that it's better to introduce a memcg event, but then > I realized it's probably not worth the wasted space. Is this a common > scenario? > > I think a better approach will be to add a cgroup event (displayed via > cgroup.events) about reaching the maximum limit of cgroups. E.g. > cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants > to some value below memcg_id space size. It's more work, but IMO it's > a better way to communicate this event. As a bonus, you can easily > get an idea which cgroup depletes the limit. For my goal (i.e. just to confirm that 64K limit was reached) this functionality is too complicated. This confirmation is important because it should push us to increase the global limit. However, I think your idea is great, In perspective it helps both OpenVZ and LXC and possibly Shakeel to understand the real memcg using and set the proper limit for containers. I'm going to prepare such patches, however I'm not sure I'll have enough time for this task in the near future. Thank you, Vasily Averin
© 2016 - 2026 Red Hat, Inc.