The delegatee shouldn't be allowed to write to the resource control
interface files. The kernel rejects writes to all files other than
"cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" on a
namespace root from inside the namespace. However, delegatee can write
"cgroup.subtree_control" outsize of the namespace, this can be reproduced
by as follows:
cd /sys/fs/cgroup
echo '+pids' > cgroup.subtree_control
mkdir dlgt_grp_ns
echo '+pids' > dlgt_grp_ns/cgroup.subtree_control
mkdir dlgt_grp_ns/dlgt_grp_ns1
echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs
echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max
unshare -Cm /bin/bash
echo max > dlgt_grp_ns/dlgt_grp_ns1/pids.max // Permission denied
echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now
We set pids.max to 200 in the cgroup dlgt_grp_ns1, and we created a new
cgroup namespace. The delegatee can't write to
dlgt_grp_ns/dlgt_grp_ns1/pids.max. However, delegatee can write to
dlgt_grp_ns/cgroup.subtree_control, which is outside of the cgroup
namespace, and this invalided the pids limitation.
Cgroup namespaces, as delegation boundaries, should disallow the delegatee
to write all interfaces outside of the cgroup namespace.
Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
kernel/cgroup/cgroup.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8dbe00000fd4..1ef9413c02e3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
* cgroup.procs, cgroup.threads and cgroup.subtree_control.
*/
if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
- !(cft->flags & CFTYPE_NS_DELEGATABLE) &&
- ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
+ ctx->ns != &init_cgroup_ns &&
+ (!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) ||
+ (!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
+ ctx->ns->root_cset->dfl_cgrp == cgrp)))
return -EPERM;
if (cft->write)
--
2.34.1
On 2024/8/12 15:37, Chen Ridong wrote:
> The delegatee shouldn't be allowed to write to the resource control
> interface files. The kernel rejects writes to all files other than
> "cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" on a
> namespace root from inside the namespace. However, delegatee can write
> "cgroup.subtree_control" outsize of the namespace, this can be reproduced
> by as follows:
>
> cd /sys/fs/cgroup
> echo '+pids' > cgroup.subtree_control
> mkdir dlgt_grp_ns
> echo '+pids' > dlgt_grp_ns/cgroup.subtree_control
> mkdir dlgt_grp_ns/dlgt_grp_ns1
> echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs
> echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max
> unshare -Cm /bin/bash
> echo max > dlgt_grp_ns/dlgt_grp_ns1/pids.max // Permission denied
> echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now
>
> We set pids.max to 200 in the cgroup dlgt_grp_ns1, and we created a new
> cgroup namespace. The delegatee can't write to
> dlgt_grp_ns/dlgt_grp_ns1/pids.max. However, delegatee can write to
> dlgt_grp_ns/cgroup.subtree_control, which is outside of the cgroup
> namespace, and this invalided the pids limitation.
>
> Cgroup namespaces, as delegation boundaries, should disallow the delegatee
> to write all interfaces outside of the cgroup namespace.
>
> Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
> kernel/cgroup/cgroup.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 8dbe00000fd4..1ef9413c02e3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
> * cgroup.procs, cgroup.threads and cgroup.subtree_control.
> */
> if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
> - !(cft->flags & CFTYPE_NS_DELEGATABLE) &&
> - ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
> + ctx->ns != &init_cgroup_ns &&
> + (!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) ||
> + (!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
> + ctx->ns->root_cset->dfl_cgrp == cgrp)))
Can you please match the indentation?
Thanks.
> return -EPERM;
>
> if (cft->write)
Hello. On Mon, Aug 12, 2024 at 07:37:46AM GMT, Chen Ridong <chenridong@huawei.com> wrote: > cd /sys/fs/cgroup > echo '+pids' > cgroup.subtree_control > mkdir dlgt_grp_ns > echo '+pids' > dlgt_grp_ns/cgroup.subtree_control > mkdir dlgt_grp_ns/dlgt_grp_ns1 > echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs > echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max > unshare -Cm /bin/bash > echo max > /dlgt_grp_ns1/pids.max // Permission denied > echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now You could also have increased the ancestral limit (if there was any) echo max > dlgt_grp_ns/pids.max // similarly allowed If you're a root (or otherwise have sufficient permissions) and you can _see_ an ancestral cgroup, you can write to its attributes according to permissions. Thus the delegation works via cgroup ns (in)visibility but cgroup ns root is visible on both sides of the boundary hence the extra check. I can imagine that a container runtime process could enter the target cgroup ns while keeping visibility to the original cgroup ns and do some tuning on it before it drops any pointers to the original cgroup ns and exec's delegatee's workload. (But it's only my imagination to illustrate that this may be a breaking change.) OTOH, I can see why this would be consistent with the migration rules that exist between sides of cgroup ns, so this could work if it was hidden behind (another) mount option like 'nsdelegate2' :-p > @@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf, > * cgroup.procs, cgroup.threads and cgroup.subtree_control. > */ > if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) && > - !(cft->flags & CFTYPE_NS_DELEGATABLE) && > - ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp) > + ctx->ns != &init_cgroup_ns && > + (!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) || > + (!(cft->flags & CFTYPE_NS_DELEGATABLE) && > + ctx->ns->root_cset->dfl_cgrp == cgrp))) > return -EPERM; Could you please also update the comment above, to describe the boundary vs subtree delegation? Thanks, Michal
Hello, On Mon, Aug 12, 2024 at 06:57:06PM +0200, Michal Koutný wrote: ... > You could also have increased the ancestral limit (if there was any) > echo max > dlgt_grp_ns/pids.max // similarly allowed > > If you're a root (or otherwise have sufficient permissions) and you can > _see_ an ancestral cgroup, you can write to its attributes according to > permissions. Thus the delegation works via cgroup ns (in)visibility but > cgroup ns root is visible on both sides of the boundary hence the extra > check. Yeah, the intended usage scenario w/ NS delegation is that the delegatee won't be able to see the ancetral cgroups beyond the delegation point. Chen, is this from an actual usecase? If so, can you describe what's going on? Thanks. -- tejun
On 2024/8/14 3:02, Tejun Heo wrote: > Hello, > > On Mon, Aug 12, 2024 at 06:57:06PM +0200, Michal Koutný wrote: > ... >> You could also have increased the ancestral limit (if there was any) >> echo max > dlgt_grp_ns/pids.max // similarly allowed >> >> If you're a root (or otherwise have sufficient permissions) and you can >> _see_ an ancestral cgroup, you can write to its attributes according to >> permissions. Thus the delegation works via cgroup ns (in)visibility but >> cgroup ns root is visible on both sides of the boundary hence the extra >> check. > > Yeah, the intended usage scenario w/ NS delegation is that the delegatee > won't be able to see the ancetral cgroups beyond the delegation point. Chen, > is this from an actual usecase? If so, can you describe what's going on? > > Thanks. > Hi,TJ, We plan to use delegation in cgroup-v2, so I am conducting some tests. As doc mentions 'Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee shouldn't be allowed to write to them.' However I found a root can write parent's file(cgroup.subtree_control) to change the resource limits(a fraudulent method). I believe this could pose a risk in some scenarios where a root enters a new cgroup ns without unmounting original cgroup system, and it can break limitations. For instance, running a docker with --privileged, could this be a risk? So I sent this patch to discuss whether this case should be addressed? Thanks, Ridong
Hello, On Wed, Aug 14, 2024 at 04:09:59PM +0800, chenridong wrote: ... > Hi,TJ, We plan to use delegation in cgroup-v2, so I am conducting some > tests. > As doc mentions 'Because the resource control interface files in a given > directory control the distribution of the parent's resources, the delegatee > shouldn't be allowed to write to them.' However I found a root can write > parent's file(cgroup.subtree_control) to change the resource limits(a > fraudulent method). I believe this could pose a risk in some scenarios where > a root enters a new cgroup ns without unmounting original cgroup system, and > it can break limitations. For instance, running a docker with --privileged, > could this be a risk? > > So I sent this patch to discuss whether this case should be addressed? That sounsd like a misconfiguration. cgroup NS doesn't make much sense if you don't limit the actual visibility. The interface is half broken in that situation anyway and if you're leaking filesystem visibility into a supposedly isolated container, relaxed resource limits aren't biggest of your problems. While the proposed change isn't necessarily a bad idea, it's a behavior change and I don't either modifying existing behavior or introducing a new mount flag is justified here. Maybe just update the documentation indicating that the ancestral cgroups shouldn't be visible in a delegated ns? Thanks. -- tejun
On 2024/8/15 0:52, Tejun Heo wrote: > Hello, > > On Wed, Aug 14, 2024 at 04:09:59PM +0800, chenridong wrote: > ... >> Hi,TJ, We plan to use delegation in cgroup-v2, so I am conducting some >> tests. >> As doc mentions 'Because the resource control interface files in a given >> directory control the distribution of the parent's resources, the delegatee >> shouldn't be allowed to write to them.' However I found a root can write >> parent's file(cgroup.subtree_control) to change the resource limits(a >> fraudulent method). I believe this could pose a risk in some scenarios where >> a root enters a new cgroup ns without unmounting original cgroup system, and >> it can break limitations. For instance, running a docker with --privileged, >> could this be a risk? >> >> So I sent this patch to discuss whether this case should be addressed? > > That sounsd like a misconfiguration. cgroup NS doesn't make much sense if > you don't limit the actual visibility. The interface is half broken in that > situation anyway and if you're leaking filesystem visibility into a > supposedly isolated container, relaxed resource limits aren't biggest of > your problems. > > While the proposed change isn't necessarily a bad idea, it's a behavior > change and I don't either modifying existing behavior or introducing a new > mount flag is justified here. Maybe just update the documentation indicating > that the ancestral cgroups shouldn't be visible in a delegated ns? > > Thanks. > Thank you, TJ, I will send a patch to update comment and the documentation. Thanks, Ridong
On 2024/8/13 0:57, Michal Koutný wrote: > Hello. > > On Mon, Aug 12, 2024 at 07:37:46AM GMT, Chen Ridong <chenridong@huawei.com> wrote: >> cd /sys/fs/cgroup >> echo '+pids' > cgroup.subtree_control >> mkdir dlgt_grp_ns >> echo '+pids' > dlgt_grp_ns/cgroup.subtree_control >> mkdir dlgt_grp_ns/dlgt_grp_ns1 >> echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs >> echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max >> unshare -Cm /bin/bash >> echo max > /dlgt_grp_ns1/pids.max // Permission denied >> echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now > > You could also have increased the ancestral limit (if there was any) > echo max > dlgt_grp_ns/pids.max // similarly allowed > > If you're a root (or otherwise have sufficient permissions) and you can > _see_ an ancestral cgroup, you can write to its attributes according to > permissions. Thus the delegation works via cgroup ns (in)visibility but > cgroup ns root is visible on both sides of the boundary hence the extra > check. > > I can imagine that a container runtime process could enter the target > cgroup ns while keeping visibility to the original cgroup ns and do some > tuning on it before it drops any pointers to the original cgroup ns and > exec's delegatee's workload. (But it's only my imagination to illustrate > that this may be a breaking change.) > Yes, this is a change. However, the process within the container can turn on subsystems , they can also turn off subsystems , which may not only affect the host machine but may also affect other containers. I believe this poses a danger. > OTOH, I can see why this would be consistent with the migration rules > that exist between sides of cgroup ns, so this could work if it was > hidden behind (another) mount option like 'nsdelegate2' :-p > Hello, Michal, I am sorry I can not get your point. Did you mean this issue should be fixed with another mount option 'nsdelegate2'? Or is the current patch acceptable? > >> @@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf, >> * cgroup.procs, cgroup.threads and cgroup.subtree_control. >> */ >> if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) && >> - !(cft->flags & CFTYPE_NS_DELEGATABLE) && >> - ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp) >> + ctx->ns != &init_cgroup_ns && >> + (!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) || >> + (!(cft->flags & CFTYPE_NS_DELEGATABLE) && >> + ctx->ns->root_cset->dfl_cgrp == cgrp))) >> return -EPERM; > > Could you please also update the comment above, to describe the boundary > vs subtree delegation? > Thanks, > Michal Sure. Thanks Ridong
© 2016 - 2026 Red Hat, Inc.