Fixes about delegation.

[PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by Chen Ridong 1 year, 6 months ago

The delegatee shouldn't be allowed to write to the resource control
interface files. The kernel rejects writes to all files other than
"cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" on a
namespace root from inside the namespace. However, delegatee can write
"cgroup.subtree_control" outsize of the namespace, this can be reproduced
by as follows:

cd /sys/fs/cgroup
echo '+pids' > cgroup.subtree_control
mkdir dlgt_grp_ns
echo '+pids' > dlgt_grp_ns/cgroup.subtree_control
mkdir dlgt_grp_ns/dlgt_grp_ns1
echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs
echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max
unshare -Cm /bin/bash
echo max > dlgt_grp_ns/dlgt_grp_ns1/pids.max // Permission denied
echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now

We set pids.max to 200 in the cgroup dlgt_grp_ns1, and we created a new
cgroup namespace. The delegatee can't write to
dlgt_grp_ns/dlgt_grp_ns1/pids.max. However, delegatee can write to
dlgt_grp_ns/cgroup.subtree_control, which is outside of the cgroup
namespace, and this invalided the pids limitation.

Cgroup namespaces, as delegation boundaries, should disallow the delegatee
to write all interfaces outside of the cgroup namespace.

Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/cgroup/cgroup.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8dbe00000fd4..1ef9413c02e3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
 	 * cgroup.procs, cgroup.threads and cgroup.subtree_control.
 	 */
 	if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
-	    !(cft->flags & CFTYPE_NS_DELEGATABLE) &&
-	    ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
+		ctx->ns != &init_cgroup_ns &&
+		(!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) ||
+			(!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
+			ctx->ns->root_cset->dfl_cgrp == cgrp)))
 		return -EPERM;
 
 	if (cft->write)
-- 
2.34.1

Re: [PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by Xiu Jianfeng 1 year, 6 months ago


On 2024/8/12 15:37, Chen Ridong wrote:
> The delegatee shouldn't be allowed to write to the resource control
> interface files. The kernel rejects writes to all files other than
> "cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" on a
> namespace root from inside the namespace. However, delegatee can write
> "cgroup.subtree_control" outsize of the namespace, this can be reproduced
> by as follows:
> 
> cd /sys/fs/cgroup
> echo '+pids' > cgroup.subtree_control
> mkdir dlgt_grp_ns
> echo '+pids' > dlgt_grp_ns/cgroup.subtree_control
> mkdir dlgt_grp_ns/dlgt_grp_ns1
> echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs
> echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max
> unshare -Cm /bin/bash
> echo max > dlgt_grp_ns/dlgt_grp_ns1/pids.max // Permission denied
> echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now
> 
> We set pids.max to 200 in the cgroup dlgt_grp_ns1, and we created a new
> cgroup namespace. The delegatee can't write to
> dlgt_grp_ns/dlgt_grp_ns1/pids.max. However, delegatee can write to
> dlgt_grp_ns/cgroup.subtree_control, which is outside of the cgroup
> namespace, and this invalided the pids limitation.
> 
> Cgroup namespaces, as delegation boundaries, should disallow the delegatee
> to write all interfaces outside of the cgroup namespace.
> 
> Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>  kernel/cgroup/cgroup.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 8dbe00000fd4..1ef9413c02e3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  	 * cgroup.procs, cgroup.threads and cgroup.subtree_control.
>  	 */
>  	if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
> -	    !(cft->flags & CFTYPE_NS_DELEGATABLE) &&
> -	    ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
> +		ctx->ns != &init_cgroup_ns &&
> +		(!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) ||
> +			(!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
> +			ctx->ns->root_cset->dfl_cgrp == cgrp)))

Can you please match the indentation?
Thanks.

>  		return -EPERM;
>  
>  	if (cft->write)

Re: [PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by Michal Koutný 1 year, 6 months ago

Hello.

On Mon, Aug 12, 2024 at 07:37:46AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
> cd /sys/fs/cgroup
> echo '+pids' > cgroup.subtree_control
> mkdir dlgt_grp_ns
> echo '+pids' > dlgt_grp_ns/cgroup.subtree_control
> mkdir dlgt_grp_ns/dlgt_grp_ns1
> echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs
> echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max
> unshare -Cm /bin/bash
> echo max > /dlgt_grp_ns1/pids.max // Permission denied
> echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now

You could also have increased the ancestral limit (if there was any)
echo max > dlgt_grp_ns/pids.max // similarly allowed

If you're a root (or otherwise have sufficient permissions) and you can
_see_ an ancestral cgroup, you can write to its attributes according to
permissions. Thus the delegation works via cgroup ns (in)visibility but
cgroup ns root is visible on both sides of the boundary hence the extra
check.

I can imagine that a container runtime process could enter the target
cgroup ns while keeping visibility to the original cgroup ns and do some
tuning on it before it drops any pointers to the original cgroup ns and
exec's delegatee's workload. (But it's only my imagination to illustrate
that this may be a breaking change.)

OTOH, I can see why this would be consistent with the migration rules
that exist between sides of cgroup ns, so this could work if it was
hidden behind (another) mount option like 'nsdelegate2' :-p

> @@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  	 * cgroup.procs, cgroup.threads and cgroup.subtree_control.
>  	 */
>  	if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
> -	    !(cft->flags & CFTYPE_NS_DELEGATABLE) &&
> -	    ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
> +		ctx->ns != &init_cgroup_ns &&
> +		(!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) ||
> +			(!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
> +			ctx->ns->root_cset->dfl_cgrp == cgrp)))
>  		return -EPERM;

Could you please also update the comment above, to describe the boundary
vs subtree delegation?

Thanks,
Michal

Re: [PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by Tejun Heo 1 year, 5 months ago

Hello,

On Mon, Aug 12, 2024 at 06:57:06PM +0200, Michal Koutný wrote:
...
> You could also have increased the ancestral limit (if there was any)
> echo max > dlgt_grp_ns/pids.max // similarly allowed
> 
> If you're a root (or otherwise have sufficient permissions) and you can
> _see_ an ancestral cgroup, you can write to its attributes according to
> permissions. Thus the delegation works via cgroup ns (in)visibility but
> cgroup ns root is visible on both sides of the boundary hence the extra
> check.

Yeah, the intended usage scenario w/ NS delegation is that the delegatee
won't be able to see the ancetral cgroups beyond the delegation point. Chen,
is this from an actual usecase? If so, can you describe what's going on?

Thanks.

-- 
tejun

Re: [PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by chenridong 1 year, 5 months ago


On 2024/8/14 3:02, Tejun Heo wrote:
> Hello,
> 
> On Mon, Aug 12, 2024 at 06:57:06PM +0200, Michal Koutný wrote:
> ...
>> You could also have increased the ancestral limit (if there was any)
>> echo max > dlgt_grp_ns/pids.max // similarly allowed
>>
>> If you're a root (or otherwise have sufficient permissions) and you can
>> _see_ an ancestral cgroup, you can write to its attributes according to
>> permissions. Thus the delegation works via cgroup ns (in)visibility but
>> cgroup ns root is visible on both sides of the boundary hence the extra
>> check.
> 
> Yeah, the intended usage scenario w/ NS delegation is that the delegatee
> won't be able to see the ancetral cgroups beyond the delegation point. Chen,
> is this from an actual usecase? If so, can you describe what's going on?
> 
> Thanks.
> 
Hi，TJ， We plan to use delegation in cgroup-v2, so I am conducting some 
tests.
As doc mentions 'Because the resource control interface files in a given 
directory control the distribution of the parent's resources, the 
delegatee shouldn't be allowed to write to them.' However I found a root 
can write parent's file(cgroup.subtree_control) to change the resource 
limits(a fraudulent method). I believe this could pose a risk in some 
scenarios where a root enters a new cgroup ns without unmounting 
original cgroup system, and it can break limitations. For instance, 
running a docker with --privileged, could this be a risk?

So I sent this patch to discuss whether this case should be addressed?

Thanks,
Ridong

Re: [PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by Tejun Heo 1 year, 5 months ago

Hello,

On Wed, Aug 14, 2024 at 04:09:59PM +0800, chenridong wrote:
...
> Hi，TJ， We plan to use delegation in cgroup-v2, so I am conducting some
> tests.
> As doc mentions 'Because the resource control interface files in a given
> directory control the distribution of the parent's resources, the delegatee
> shouldn't be allowed to write to them.' However I found a root can write
> parent's file(cgroup.subtree_control) to change the resource limits(a
> fraudulent method). I believe this could pose a risk in some scenarios where
> a root enters a new cgroup ns without unmounting original cgroup system, and
> it can break limitations. For instance, running a docker with --privileged,
> could this be a risk?
> 
> So I sent this patch to discuss whether this case should be addressed?

That sounsd like a misconfiguration. cgroup NS doesn't make much sense if
you don't limit the actual visibility. The interface is half broken in that
situation anyway and if you're leaking filesystem visibility into a
supposedly isolated container, relaxed resource limits aren't biggest of
your problems.

While the proposed change isn't necessarily a bad idea, it's a behavior
change and I don't either modifying existing behavior or introducing a new
mount flag is justified here. Maybe just update the documentation indicating
that the ancestral cgroups shouldn't be visible in a delegated ns?

Thanks.

-- 
tejun

Re: [PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by chenridong 1 year, 5 months ago


On 2024/8/15 0:52, Tejun Heo wrote:
> Hello,
> 
> On Wed, Aug 14, 2024 at 04:09:59PM +0800, chenridong wrote:
> ...
>> Hi，TJ， We plan to use delegation in cgroup-v2, so I am conducting some
>> tests.
>> As doc mentions 'Because the resource control interface files in a given
>> directory control the distribution of the parent's resources, the delegatee
>> shouldn't be allowed to write to them.' However I found a root can write
>> parent's file(cgroup.subtree_control) to change the resource limits(a
>> fraudulent method). I believe this could pose a risk in some scenarios where
>> a root enters a new cgroup ns without unmounting original cgroup system, and
>> it can break limitations. For instance, running a docker with --privileged,
>> could this be a risk?
>>
>> So I sent this patch to discuss whether this case should be addressed?
> 
> That sounsd like a misconfiguration. cgroup NS doesn't make much sense if
> you don't limit the actual visibility. The interface is half broken in that
> situation anyway and if you're leaking filesystem visibility into a
> supposedly isolated container, relaxed resource limits aren't biggest of
> your problems.
> 
> While the proposed change isn't necessarily a bad idea, it's a behavior
> change and I don't either modifying existing behavior or introducing a new
> mount flag is justified here. Maybe just update the documentation indicating
> that the ancestral cgroups shouldn't be visible in a delegated ns?
> 
> Thanks.
> 

Thank you, TJ, I will send a patch to update comment and the documentation.

Thanks,
Ridong

Re: [PATCH -next 2/2] cgroup: Disallow delegatee to write all interfaces outsize of cgroup ns

Posted by chenridong 1 year, 6 months ago


On 2024/8/13 0:57, Michal Koutný wrote:
> Hello.
> 
> On Mon, Aug 12, 2024 at 07:37:46AM GMT, Chen Ridong <chenridong@huawei.com> wrote:
>> cd /sys/fs/cgroup
>> echo '+pids' > cgroup.subtree_control
>> mkdir dlgt_grp_ns
>> echo '+pids' > dlgt_grp_ns/cgroup.subtree_control
>> mkdir dlgt_grp_ns/dlgt_grp_ns1
>> echo $$ > dlgt_grp_ns/dlgt_grp_ns1/cgroup.procs
>> echo 200 > dlgt_grp_ns/dlgt_grp_ns1/pids.max
>> unshare -Cm /bin/bash
>> echo max > /dlgt_grp_ns1/pids.max // Permission denied
>> echo -pids > dlgt_grp_ns/cgroup.subtree_control // pids was unlimited now
> 
> You could also have increased the ancestral limit (if there was any)
> echo max > dlgt_grp_ns/pids.max // similarly allowed
> 
> If you're a root (or otherwise have sufficient permissions) and you can
> _see_ an ancestral cgroup, you can write to its attributes according to
> permissions. Thus the delegation works via cgroup ns (in)visibility but
> cgroup ns root is visible on both sides of the boundary hence the extra
> check.
> 
> I can imagine that a container runtime process could enter the target
> cgroup ns while keeping visibility to the original cgroup ns and do some
> tuning on it before it drops any pointers to the original cgroup ns and
> exec's delegatee's workload. (But it's only my imagination to illustrate
> that this may be a breaking change.)
> 
Yes, this is a change. However, the process within the container can 
turn on subsystems , they can also turn off subsystems , which may not 
only affect the host machine but may also affect other containers. I 
believe this poses a danger.

> OTOH, I can see why this would be consistent with the migration rules
> that exist between sides of cgroup ns, so this could work if it was
> hidden behind (another) mount option like 'nsdelegate2' :-p
> 

Hello, Michal, I am sorry I can not get your point.
Did you mean this issue should be fixed with another mount option 
'nsdelegate2'? Or is the current patch acceptable?

> 
>> @@ -4134,8 +4134,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>>   	 * cgroup.procs, cgroup.threads and cgroup.subtree_control.
>>   	 */
>>   	if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
>> -	    !(cft->flags & CFTYPE_NS_DELEGATABLE) &&
>> -	    ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
>> +		ctx->ns != &init_cgroup_ns &&
>> +		(!cgroup_is_descendant(cgrp, ctx->ns->root_cset->dfl_cgrp) ||
>> +			(!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
>> +			ctx->ns->root_cset->dfl_cgrp == cgrp)))
>>   		return -EPERM;
> 
> Could you please also update the comment above, to describe the boundary
> vs subtree delegation?
> Thanks,
> Michal

Sure.

Thanks
Ridong