mm/memcontrol.c | 1 + 1 file changed, 1 insertion(+)
Currently host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot
start a new service, but nothing points to the real cause of
this failure.
Signed-off-by: Vasily Averin <vvs@openvz.org>
---
mm/memcontrol.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..5229321636f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
if (memcg->id.id < 0) {
error = memcg->id.id;
+ pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
goto fail;
}
--
2.36.1
On Sat, Jun 25, 2022 at 05:04:27PM +0300, Vasily Averin wrote:
> Currently host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot
> start a new service, but nothing points to the real cause of
> this failure.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> mm/memcontrol.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..5229321636f2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
> if (memcg->id.id < 0) {
> error = memcg->id.id;
> + pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
> goto fail;
> }
Hm, in this case it should return -ENOSPC and it's a very unique return code.
If it's not returned from the mkdir() call, we should fix this.
Otherwise it's up to systemd to handle it properly.
I'm not opposing for adding a warning, but parsing dmesg is not how
the error handling should be done.
Thanks!
On 6/26/22 04:56, Roman Gushchin wrote:
> On Sat, Jun 25, 2022 at 05:04:27PM +0300, Vasily Averin wrote:
>> Currently host owner is not informed about the exhaustion of the
>> global mem_cgroup_id space. When this happens, systemd cannot
>> start a new service, but nothing points to the real cause of
>> this failure.
>>
>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>> ---
>> mm/memcontrol.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d4c606a06bcd..5229321636f2 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>> 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>> if (memcg->id.id < 0) {
>> error = memcg->id.id;
>> + pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>> goto fail;
>> }
>
> Hm, in this case it should return -ENOSPC and it's a very unique return code.
> If it's not returned from the mkdir() call, we should fix this.
> Otherwise it's up to systemd to handle it properly.
>
> I'm not opposing for adding a warning, but parsing dmesg is not how
> the error handling should be done.
I'm agree, I think it's a good idea. Moreover I think it makes sense to
use -ENOSPC when the local cgroup's limit is reached.
Currently cgroup_mkdir() returns -EAGAIN, this looks strange for me.
if (!cgroup_check_hierarchy_limits(parent)) {
ret = -EAGAIN;
goto out_unlock;
}
Thank you,
Vasily Averin
When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
not return -EAGAIN, but instead react similarly to reaching the global
limit.
Signed-off-by: Vasily Averin <vvs@openvz.org>
---
kernel/cgroup/cgroup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1be0f81fe8e1..243239553ea3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
return -ENODEV;
if (!cgroup_check_hierarchy_limits(parent)) {
- ret = -EAGAIN;
+ ret = -ENOSPC;
goto out_unlock;
}
--
2.36.1
On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
> not return -EAGAIN, but instead react similarly to reaching the global
> limit.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> kernel/cgroup/cgroup.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1be0f81fe8e1..243239553ea3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
> return -ENODEV;
>
> if (!cgroup_check_hierarchy_limits(parent)) {
> - ret = -EAGAIN;
> + ret = -ENOSPC;
I'd not argue whether ENOSPC is better or worse here, but I don't think we need
to change it now. It's been in this state for a long time and is a part of ABI.
EAGAIN is pretty unique as a mkdir() result, so systemd can handle it well.
Thanks!
On 6/28/22 03:44, Roman Gushchin wrote:
> On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
>> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
>> not return -EAGAIN, but instead react similarly to reaching the global
>> limit.
>>
>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>> ---
>> kernel/cgroup/cgroup.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1be0f81fe8e1..243239553ea3 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
>> return -ENODEV;
>>
>> if (!cgroup_check_hierarchy_limits(parent)) {
>> - ret = -EAGAIN;
>> + ret = -ENOSPC;
>
> I'd not argue whether ENOSPC is better or worse here, but I don't think we need
> to change it now. It's been in this state for a long time and is a part of ABI.
> EAGAIN is pretty unique as a mkdir() result, so systemd can handle it well.
I would agree with you, however in my opinion EAGAIN is used to restart an
interrupted system call. Thus, I worry its return can loop the user space without
any chance of continuation.
However, maybe I'm confusing something?
Thank you,
Vasily Averin
On Tue, Jun 28, 2022 at 06:59:06AM +0300, Vasily Averin <vvs@openvz.org> wrote: > I would agree with you, however in my opinion EAGAIN is used to restart an > interrupted system call. Thus, I worry its return can loop the user space without > any chance of continuation. > > However, maybe I'm confusing something? The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better sense here. (And I suspect the dependency on this particular value won't be very wide spread.) 0.02€ Michal
On Tue, Jun 28, 2022 at 11:16:48AM +0200, Michal Koutný wrote: > The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better > sense here. (And I suspect the dependency on this particular value won't > be very wide spread.) Given how we use these system calls as triggers for random kernel operations, I don't think adhering to posix standard is necessary or possible. Using an error code which isn't listed in the man page isn't particularly high in the list of discrepancies. Again, I'm not against changing it but I'd like to see better rationales. On one side, we have "it's been this way for a long time and there's nothing particularly broken about it". I'm not sure the arguments we have for the other side is strong enough yet. Thanks. -- tejun
On 6/28/22 12:22, Tejun Heo wrote: > On Tue, Jun 28, 2022 at 11:16:48AM +0200, Michal Koutný wrote: >> The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better >> sense here. (And I suspect the dependency on this particular value won't >> be very wide spread.) > > Given how we use these system calls as triggers for random kernel > operations, I don't think adhering to posix standard is necessary or > possible. Using an error code which isn't listed in the man page isn't > particularly high in the list of discrepancies. > > Again, I'm not against changing it but I'd like to see better > rationales. On one side, we have "it's been this way for a long time > and there's nothing particularly broken about it". I'm not sure the > arguments we have for the other side is strong enough yet. I would like to recall this patch. I experimented on fedora36 node with LXC and centos stream 9 container. and I did not noticed any critical systemd troubles with original -EAGAIN. When cgroup's limit is reached systemd cannot start new services, for example lxc-attach generates following output: [root@fc34-vvs ~]# lxc-attach c9s lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc" lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11 lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container I did not found any loop in userspace caused by EAGAIN. Messages looks unclear, however situation with the patched kernel is not much better: [root@fc34-vvs ~]# lxc-attach c9s lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc" lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11 lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container Thank you, Vasily Averin
On Wed, Jun 29, 2022 at 09:13:02AM +0300, Vasily Averin wrote: > I experimented on fedora36 node with LXC and centos stream 9 container. > and I did not noticed any critical systemd troubles with original -EAGAIN. > When cgroup's limit is reached systemd cannot start new services, > for example lxc-attach generates following output: > > [root@fc34-vvs ~]# lxc-attach c9s > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc" > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11 > lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container > > I did not found any loop in userspace caused by EAGAIN. > Messages looks unclear, however situation with the patched kernel is not much better: > > [root@fc34-vvs ~]# lxc-attach c9s > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc" > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11 > lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container I'd say "resource temporarily unavailable" is better fitting than "no space left on device" and the syscall restart thing isn't handled by -EAGAIN return value. Grep restart_block for that. Thanks. -- tejun
On Thu, Jun 30, 2022 at 04:25:57AM +0900, Tejun Heo wrote: > On Wed, Jun 29, 2022 at 09:13:02AM +0300, Vasily Averin wrote: > > I experimented on fedora36 node with LXC and centos stream 9 container. > > and I did not noticed any critical systemd troubles with original -EAGAIN. > > When cgroup's limit is reached systemd cannot start new services, > > for example lxc-attach generates following output: > > > > [root@fc34-vvs ~]# lxc-attach c9s > > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc" > > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11 > > lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup > > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd > > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container > > > > I did not found any loop in userspace caused by EAGAIN. > > Messages looks unclear, however situation with the patched kernel is not much better: > > > > [root@fc34-vvs ~]# lxc-attach c9s > > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc" > > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11 > > lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup > > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd > > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container > > I'd say "resource temporarily unavailable" is better fitting than "no > space left on device" +1 Thanks!
On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote: > When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should > not return -EAGAIN, but instead react similarly to reaching the global > limit. While I'm not necessarily against this change, I find the rationale to be somewhat lacking. Can you please elaborate why -ENOSPC is the right one while -EAGAIN is incorrect? Thanks. -- tejun
On Mon, Jun 27, 2022 at 10:12 AM Vasily Averin <vvs@openvz.org> wrote: > > When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should > not return -EAGAIN, but instead react similarly to reaching the global > limit. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Thanks.
Currently, the host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot start a
new service and receives a unique -ENOSPC error code.
However, this can happen inside this container, persist in the log file
of the local container, and may not be noticed by the host owner if he
did not try to start any new services.
Signed-off-by: Vasily Averin <vvs@openvz.org>
---
v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC
if no free IDs could be found, but can also return -ENOMEM.
Therefore error code check was added before message output and
patch descriprion was adopted.
---
mm/memcontrol.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..ffc6b5d6b95e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
if (memcg->id.id < 0) {
error = memcg->id.id;
+ if (error == -ENOSPC)
+ pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
goto fail;
}
--
2.36.1
On Mon, Jun 27, 2022 at 10:11 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Currently, the host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot start a
> new service and receives a unique -ENOSPC error code.
> However, this can happen inside this container, persist in the log file
> of the local container, and may not be noticed by the host owner if he
> did not try to start any new services.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC
If the caller can know -ENOSPC is returned by mkdir(), then I
think the user (perhaps systemd) is the best place to throw out the
error message instead of in the kernel log. Right?
Thanks.
> if no free IDs could be found, but can also return -ENOMEM.
> Therefore error code check was added before message output and
> patch descriprion was adopted.
> ---
> mm/memcontrol.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..ffc6b5d6b95e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
> if (memcg->id.id < 0) {
> error = memcg->id.id;
> + if (error == -ENOSPC)
> + pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
> goto fail;
> }
>
> --
> 2.36.1
>
On 6/27/22 06:23, Muchun Song wrote: > If the caller can know -ENOSPC is returned by mkdir(), then I > think the user (perhaps systemd) is the best place to throw out the > error message instead of in the kernel log. Right? Such an incident may occur inside the container. OpenVZ nodes can host 300-400 containers, and the host admin cannot monitor guest logs. the dmesg message is necessary to inform the host owner that the global limit has been reached, otherwise he can continue to believe that there are no problems on the node. Thank you, Vasily Averin
On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote: > On 6/27/22 06:23, Muchun Song wrote: > > If the caller can know -ENOSPC is returned by mkdir(), then I > > think the user (perhaps systemd) is the best place to throw out the > > error message instead of in the kernel log. Right? > > Such an incident may occur inside the container. > OpenVZ nodes can host 300-400 containers, and the host admin cannot > monitor guest logs. the dmesg message is necessary to inform the host > owner that the global limit has been reached, otherwise he can > continue to believe that there are no problems on the node. Why this is happening? It's hard to believe someone really needs that many cgroups. Is this when somebody fails to delete old cgroups? I wanted to say that it's better to introduce a memcg event, but then I realized it's probably not worth the wasted space. Is this a common scenario? I think a better approach will be to add a cgroup event (displayed via cgroup.events) about reaching the maximum limit of cgroups. E.g. cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants to some value below memcg_id space size. It's more work, but IMO it's a better way to communicate this event. As a bonus, you can easily get an idea which cgroup depletes the limit. Thanks!
On Mon, Jun 27, 2022 at 06:11:27PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote: > I think a better approach will be to add a cgroup event (displayed via > cgroup.events) about reaching the maximum limit of cgroups. E.g. > cgroups.events::max_nr_reached. This sounds like a good generalization. > Then you can set cgroup.max.descendants to some value below memcg_id > space size. It's more work, but IMO it's a better way to communicate > this event. As a bonus, you can easily get an idea which cgroup > depletes the limit. Just mind there's a difference between events: what cgroup's limit was hit and what cgroup was affected by the limit [1] (the former is more useful for the calibration if I understand the situation). Michal [1] https://lore.kernel.org/all/20200205134426.10570-2-mkoutny@suse.com/
On 6/28/22 04:11, Roman Gushchin wrote: > On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote: >> On 6/27/22 06:23, Muchun Song wrote: >>> If the caller can know -ENOSPC is returned by mkdir(), then I >>> think the user (perhaps systemd) is the best place to throw out the >>> error message instead of in the kernel log. Right? >> >> Such an incident may occur inside the container. >> OpenVZ nodes can host 300-400 containers, and the host admin cannot >> monitor guest logs. the dmesg message is necessary to inform the host >> owner that the global limit has been reached, otherwise he can >> continue to believe that there are no problems on the node. > > Why this is happening? It's hard to believe someone really needs that > many cgroups. Is this when somebody fails to delete old cgroups? I do not have direct claims that some node really reached this limit, however I saw crashdumps with 30000+ cgroups. Theoretically OpenVz/LXC nodes can host up to several thousand containers per node. Practically production nodes with 300-400 containers are a common thing. I assume that each container can easily use up to 100-200 memory cgroups, and I think this is normal consumption. Therefore, I believe that 64K limit is quite achievable in real life. Primary goal of my patch is to confirm this theory. > I wanted to say that it's better to introduce a memcg event, but then > I realized it's probably not worth the wasted space. Is this a common > scenario? > > I think a better approach will be to add a cgroup event (displayed via > cgroup.events) about reaching the maximum limit of cgroups. E.g. > cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants > to some value below memcg_id space size. It's more work, but IMO it's > a better way to communicate this event. As a bonus, you can easily > get an idea which cgroup depletes the limit. For my goal (i.e. just to confirm that 64K limit was reached) this functionality is too complicated. This confirmation is important because it should push us to increase the global limit. However, I think your idea is great, In perspective it helps both OpenVZ and LXC and possibly Shakeel to understand the real memcg using and set the proper limit for containers. I'm going to prepare such patches, however I'm not sure I'll have enough time for this task in the near future. Thank you, Vasily Averin
© 2016 - 2026 Red Hat, Inc.