[RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min

Liu Song posted 1 patch 2 months, 2 weeks ago
kernel/sched/core.c | 21 +++++----------------
1 file changed, 5 insertions(+), 16 deletions(-)
[RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by Liu Song 2 months, 2 weeks ago
For the handling logic of child_quota, there is no need to distinguish
between cgroup1 and cgroup2, so unify the handling logic here.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
---
 kernel/sched/core.c | 21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e752146e59a4..8418c67faa69 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9501,23 +9501,12 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
 		parent_quota = parent_b->hierarchical_quota;
 
 		/*
-		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
-		 * always take the non-RUNTIME_INF min.  On cgroup1, only
-		 * inherit when no limit is set. In both cases this is used
-		 * by the scheduler to determine if a given CFS task has a
-		 * bandwidth constraint at some higher level.
+		 * Ensure max(child_quota) <= parent_quota.
 		 */
-		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
-			if (quota == RUNTIME_INF)
-				quota = parent_quota;
-			else if (parent_quota != RUNTIME_INF)
-				quota = min(quota, parent_quota);
-		} else {
-			if (quota == RUNTIME_INF)
-				quota = parent_quota;
-			else if (parent_quota != RUNTIME_INF && quota > parent_quota)
-				return -EINVAL;
-		}
+		if (quota == RUNTIME_INF)
+			quota = parent_quota;
+		else if (parent_quota != RUNTIME_INF)
+			quota = min(quota, parent_quota);
 	}
 	cfs_b->hierarchical_quota = quota;
 
-- 
2.19.1.6.gb485710b
Re: [RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by Tejun Heo 2 months, 2 weeks ago
On Tue, Sep 10, 2024 at 03:48:32PM +0800, Liu Song wrote:
> For the handling logic of child_quota, there is no need to distinguish
> between cgroup1 and cgroup2, so unify the handling logic here.
> 
> Signed-off-by: Liu Song <liusong@linux.alibaba.com>

It doens't make much sense to change the interface for cgroup1 at this
point. Let's please leave it as-is.

Thanks.

-- 
tejun
Re: [RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by 刘嵩 2 months, 2 weeks ago

> 2024年9月11日 03:32,Tejun Heo <tj@kernel.org> 写道:
> 
> On Tue, Sep 10, 2024 at 03:48:32PM +0800, Liu Song wrote:
>> For the handling logic of child_quota, there is no need to distinguish
>> between cgroup1 and cgroup2, so unify the handling logic here.
>> 
>> Signed-off-by: Liu Song <liusong@linux.alibaba.com>
> 
> It doens't make much sense to change the interface for cgroup1 at this
> point. Let's please leave it as-is.
> 
> Thanks.
> 
> -- 
> tejun

Hi

In scenarios involving secure shared containers (like Kata), where containers are deployed
on VMs and constrained by CPU runtime using quotas, the concept of vCPUs comes into
play.

If the CPU limit set by Kubernetes is less than the actual number of vCPUs, meaning the
CPU count derived from the quota is less than the vCPU count, then when a user runs lscpu
inside the container, the reported CPU count will be greater than the container's quota.

If the user uses this reported count to calculate quota and attempts to set it for their own
sub-container, it will result in an error under cgroup1, whereas the same operation will
succeed under cgroup2. To avoid imposing extra learning costs on users, unifying the
handling logic in this regard is still beneficial.

Thanks
Re: [RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by Phil Auld 2 months, 2 weeks ago
Hi,

On Tue, Sep 10, 2024 at 03:48:32PM +0800 Liu Song wrote:
> For the handling logic of child_quota, there is no need to distinguish
> between cgroup1 and cgroup2, so unify the handling logic here.
> 
> Signed-off-by: Liu Song <liusong@linux.alibaba.com>
> ---
>  kernel/sched/core.c | 21 +++++----------------
>  1 file changed, 5 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e752146e59a4..8418c67faa69 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9501,23 +9501,12 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>  		parent_quota = parent_b->hierarchical_quota;
>  
>  		/*
> -		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> -		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> -		 * inherit when no limit is set. In both cases this is used
> -		 * by the scheduler to determine if a given CFS task has a
> -		 * bandwidth constraint at some higher level.

This comment is here for a reason. Please don't remove it. 

> +		 * Ensure max(child_quota) <= parent_quota.
>  		 */
> -		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> -			if (quota == RUNTIME_INF)
> -				quota = parent_quota;
> -			else if (parent_quota != RUNTIME_INF)
> -				quota = min(quota, parent_quota);
> -		} else {
> -			if (quota == RUNTIME_INF)
> -				quota = parent_quota;
> -			else if (parent_quota != RUNTIME_INF && quota > parent_quota)
> -				return -EINVAL;
> -		}
> +		if (quota == RUNTIME_INF)
> +			quota = parent_quota;
> +		else if (parent_quota != RUNTIME_INF)
> +			quota = min(quota, parent_quota);
>  	}
>  	cfs_b->hierarchical_quota = quota;
>

I don't think there is a need to optimize this slow path
to allow setting invalid values which have to be handled in
fast paths.   And this will change expected behavior.

So NAK.

Cheers,
Phil

--
Re: [RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by 刘嵩 2 months, 2 weeks ago

> 2024年9月10日 18:49,Phil Auld <pauld@redhat.com> 写道:
> 
> 
> Hi,
> 
> On Tue, Sep 10, 2024 at 03:48:32PM +0800 Liu Song wrote:
>> For the handling logic of child_quota, there is no need to distinguish
>> between cgroup1 and cgroup2, so unify the handling logic here.
>> 
>> Signed-off-by: Liu Song <liusong@linux.alibaba.com>
>> ---
>> kernel/sched/core.c | 21 +++++----------------
>> 1 file changed, 5 insertions(+), 16 deletions(-)
>> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index e752146e59a4..8418c67faa69 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9501,23 +9501,12 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>> parent_quota = parent_b->hierarchical_quota;
>> 
>> /*
>> - * Ensure max(child_quota) <= parent_quota.  On cgroup2,
>> - * always take the non-RUNTIME_INF min.  On cgroup1, only
>> - * inherit when no limit is set. In both cases this is used
>> - * by the scheduler to determine if a given CFS task has a
>> - * bandwidth constraint at some higher level.
> 
> This comment is here for a reason. Please don't remove it.

Hi

I don’t see why cgroup1 needs to impose this restriction while cgroup2
can directly take the non-RUNTIME_INF minimum value. What is the
necessity of this? 

It seems more reasonable to unify the handling logic. Even if the child
group quota exceeds the parent group quota, it would not actually take
effect. 

However, if the parent group quota is reset to a larger value, then the
child group quota would have actual significance. Therefore, the handling
logic should be consistent between cgroup1 and cgroup2.

Thanks


> 
>> + * Ensure max(child_quota) <= parent_quota.
>> */
>> - if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
>> - if (quota == RUNTIME_INF)
>> - quota = parent_quota;
>> - else if (parent_quota != RUNTIME_INF)
>> - quota = min(quota, parent_quota);
>> - } else {
>> - if (quota == RUNTIME_INF)
>> - quota = parent_quota;
>> - else if (parent_quota != RUNTIME_INF && quota > parent_quota)
>> - return -EINVAL;
>> - }
>> + if (quota == RUNTIME_INF)
>> + quota = parent_quota;
>> + else if (parent_quota != RUNTIME_INF)
>> + quota = min(quota, parent_quota);
>> }
>> cfs_b->hierarchical_quota = quota;
>> 
> 
> I don't think there is a need to optimize this slow path
> to allow setting invalid values which have to be handled in
> fast paths.   And this will change expected behavior.
> 
> So NAK.
> 
> Cheers,
> Phil
> 
> --
Re: [RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by Phil Auld 2 months, 2 weeks ago
On Tue, Sep 10, 2024 at 07:13:32PM +0800 刘嵩 wrote:
> 
> 
> > 2024年9月10日 18:49,Phil Auld <pauld@redhat.com> 写道:
> > 
> > 
> > Hi,
> > 
> > On Tue, Sep 10, 2024 at 03:48:32PM +0800 Liu Song wrote:
> >> For the handling logic of child_quota, there is no need to distinguish
> >> between cgroup1 and cgroup2, so unify the handling logic here.
> >> 
> >> Signed-off-by: Liu Song <liusong@linux.alibaba.com>
> >> ---
> >> kernel/sched/core.c | 21 +++++----------------
> >> 1 file changed, 5 insertions(+), 16 deletions(-)
> >> 
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index e752146e59a4..8418c67faa69 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -9501,23 +9501,12 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> >> parent_quota = parent_b->hierarchical_quota;
> >> 
> >> /*
> >> - * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> >> - * always take the non-RUNTIME_INF min.  On cgroup1, only
> >> - * inherit when no limit is set. In both cases this is used
> >> - * by the scheduler to determine if a given CFS task has a
> >> - * bandwidth constraint at some higher level.
> > 
> > This comment is here for a reason. Please don't remove it.
> 
> Hi
> 
> I don’t see why cgroup1 needs to impose this restriction while cgroup2
> can directly take the non-RUNTIME_INF minimum value. What is the
> necessity of this? 
>

That's how cgroupv1 bandwidth control is defined. See
Documentation/scheduler/sched-bcw.rst.

> It seems more reasonable to unify the handling logic. Even if the child
> group quota exceeds the parent group quota, it would not actually take
> effect. 
>

It's not about it taking effect or not. You are not supposed to be
allowed to configure a child quota > parent quota. It's supposed to
be an error. 

Also, my comment about the comment specifically is that last sentence, which
explains that other parts of the scheduler rely on this being set correctly,
needs to remain.  But since I don't think this change is right, that should
not be an issue.


Cheers,
Phil

> However, if the parent group quota is reset to a larger value, then the
> child group quota would have actual significance. Therefore, the handling
> logic should be consistent between cgroup1 and cgroup2.
> 
> Thanks
> 
> 
> > 
> >> + * Ensure max(child_quota) <= parent_quota.
> >> */
> >> - if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> >> - if (quota == RUNTIME_INF)
> >> - quota = parent_quota;
> >> - else if (parent_quota != RUNTIME_INF)
> >> - quota = min(quota, parent_quota);
> >> - } else {
> >> - if (quota == RUNTIME_INF)
> >> - quota = parent_quota;
> >> - else if (parent_quota != RUNTIME_INF && quota > parent_quota)
> >> - return -EINVAL;
> >> - }
> >> + if (quota == RUNTIME_INF)
> >> + quota = parent_quota;
> >> + else if (parent_quota != RUNTIME_INF)
> >> + quota = min(quota, parent_quota);
> >> }
> >> cfs_b->hierarchical_quota = quota;
> >> 
> > 
> > I don't think there is a need to optimize this slow path
> > to allow setting invalid values which have to be handled in
> > fast paths.   And this will change expected behavior.
> > 
> > So NAK.
> > 
> > Cheers,
> > Phil
> > 
> > --
> 

-- 

Re: [RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by Michal Koutný 2 months, 2 weeks ago
Hello.

On Tue, Sep 10, 2024 at 03:48:32PM GMT, Liu Song <liusong@linux.alibaba.com> wrote:
> For the handling logic of child_quota, there is no need to distinguish
> between cgroup1 and cgroup2, so unify the handling logic here.

IIUC, v1 check prevents quota overcommit while v2 doesn't.  So this
isn't user invisible change (i.e. there may be a need to distinguish the
two).

Regards,
Michal
Re: [RFC PATCH] sched, cgroup: cgroup1 can also take the non-RUNTIME_INF min
Posted by 刘嵩 2 months, 2 weeks ago

> 2024年9月10日 17:35,Michal Koutný <mkoutny@suse.com> 写道:
> 
> Hello.
> 
> On Tue, Sep 10, 2024 at 03:48:32PM GMT, Liu Song <liusong@linux.alibaba.com> wrote:
>> For the handling logic of child_quota, there is no need to distinguish
>> between cgroup1 and cgroup2, so unify the handling logic here.
> 
> IIUC, v1 check prevents quota overcommit while v2 doesn't.  So this
> isn't user invisible change (i.e. there may be a need to distinguish the
> two).
> 
Hi

In cgroup2, child task group quota can set exceed parent task group quota.
However, at the scheduling level, the cpu runqueue (rq) performs checks
on the runtime of the CFS run queue (cfs_rq) in a hierarchical way. 
Therefore, even though the child group quota appears to be greater than the
parent group quota, it will not actually receive a runtime that exceeds the
parent's quota. This logic also holds true for cgroup1, so there is no need to
differentiate the handling logic in this case.

Thanks

> Regards,
> Michal