[v1] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues

[PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Waiman Long 2 years, 10 months ago

The current cpuset code uses the global cpuset_attach_old_cs variable
to store the old cpuset value between consecutive cpuset_can_attach()
and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
attach operations are possible.

When there are concurrent cpuset attach operations in progress,
cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
causing incorrect result.  To avoid this problem while still allowing
certain level of parallelism, drop cpuset_attach_old_cs and use a
per-cpuset attach_old_cs value. Also restrict to at most one active
attach operation per cpuset to avoid corrupting the value of the
per-cpuset attach_old_cs value.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2367de611c42..3f925c261513 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -198,6 +198,8 @@ struct cpuset {
 
 	/* Handle for cpuset.cpus.partition */
 	struct cgroup_file partition_file;
+
+	struct cpuset *attach_old_cs;
 };
 
 /*
@@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-static struct cpuset *cpuset_attach_old_cs;
-
 /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
 static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
 	struct cgroup_subsys_state *css;
-	struct cpuset *cs;
+	struct cpuset *cs, *oldcs;
 	struct task_struct *task;
 	int ret;
 
 	/* used later by cpuset_attach() */
-	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
+	oldcs = task_cs(cgroup_taskset_first(tset, &css));
 	cs = css_cs(css);
 
 	percpu_down_write(&cpuset_rwsem);
 
+	/*
+	 * Only one cpuset attach operation is allowed for each cpuset.
+	 */
+	ret = -EBUSY;
+	if (cs->attach_in_progress)
+		goto out_unlock;
+
 	/* allow moving tasks into an empty cpuset if on default hierarchy */
 	ret = -ENOSPC;
 	if (!is_in_v2_mode() &&
@@ -2498,6 +2505,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	 * changes which zero cpus/mems_allowed.
 	 */
 	cs->attach_in_progress++;
+	cs->attach_old_cs = oldcs;
 	ret = 0;
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
@@ -2548,7 +2556,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	struct task_struct *leader;
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
-	struct cpuset *oldcs = cpuset_attach_old_cs;
+	struct cpuset *oldcs;
 	bool cpus_updated, mems_updated;
 
 	cgroup_taskset_first(tset, &css);
@@ -2556,6 +2564,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
 	percpu_down_write(&cpuset_rwsem);
+	oldcs = cs->attach_old_cs;
 	cpus_updated = !cpumask_equal(cs->effective_cpus,
 				      oldcs->effective_cpus);
 	mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
-- 
2.31.1

Re: [PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Christian Brauner 2 years, 10 months ago

On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long wrote:
> The current cpuset code uses the global cpuset_attach_old_cs variable
> to store the old cpuset value between consecutive cpuset_can_attach()
> and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
> not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
> attach operations are possible.
> 
> When there are concurrent cpuset attach operations in progress,
> cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
> causing incorrect result.  To avoid this problem while still allowing
> certain level of parallelism, drop cpuset_attach_old_cs and use a
> per-cpuset attach_old_cs value. Also restrict to at most one active
> attach operation per cpuset to avoid corrupting the value of the
> per-cpuset attach_old_cs value.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 2367de611c42..3f925c261513 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -198,6 +198,8 @@ struct cpuset {
>  
>  	/* Handle for cpuset.cpus.partition */
>  	struct cgroup_file partition_file;
> +
> +	struct cpuset *attach_old_cs;
>  };
>  
>  /*
> @@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
>  	return val;
>  }
>  
> -static struct cpuset *cpuset_attach_old_cs;
> -
>  /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
>  static int cpuset_can_attach(struct cgroup_taskset *tset)
>  {
>  	struct cgroup_subsys_state *css;
> -	struct cpuset *cs;
> +	struct cpuset *cs, *oldcs;
>  	struct task_struct *task;
>  	int ret;
>  
>  	/* used later by cpuset_attach() */
> -	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> +	oldcs = task_cs(cgroup_taskset_first(tset, &css));
>  	cs = css_cs(css);
>  
>  	percpu_down_write(&cpuset_rwsem);
>  
> +	/*
> +	 * Only one cpuset attach operation is allowed for each cpuset.
> +	 */
> +	ret = -EBUSY;
> +	if (cs->attach_in_progress)
> +		goto out_unlock;

That'll mean CLONE_INTO_CGROUP becomes even more interestig because it
isn't subject to this restriction in contrast to fork()+migrate, right?

Re: [PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Waiman Long 2 years, 10 months ago

On 4/6/23 05:44, Christian Brauner wrote:
> On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long wrote:
>> The current cpuset code uses the global cpuset_attach_old_cs variable
>> to store the old cpuset value between consecutive cpuset_can_attach()
>> and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
>> not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
>> attach operations are possible.
>>
>> When there are concurrent cpuset attach operations in progress,
>> cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
>> causing incorrect result.  To avoid this problem while still allowing
>> certain level of parallelism, drop cpuset_attach_old_cs and use a
>> per-cpuset attach_old_cs value. Also restrict to at most one active
>> attach operation per cpuset to avoid corrupting the value of the
>> per-cpuset attach_old_cs value.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
>>   1 file changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 2367de611c42..3f925c261513 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -198,6 +198,8 @@ struct cpuset {
>>   
>>   	/* Handle for cpuset.cpus.partition */
>>   	struct cgroup_file partition_file;
>> +
>> +	struct cpuset *attach_old_cs;
>>   };
>>   
>>   /*
>> @@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
>>   	return val;
>>   }
>>   
>> -static struct cpuset *cpuset_attach_old_cs;
>> -
>>   /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
>>   static int cpuset_can_attach(struct cgroup_taskset *tset)
>>   {
>>   	struct cgroup_subsys_state *css;
>> -	struct cpuset *cs;
>> +	struct cpuset *cs, *oldcs;
>>   	struct task_struct *task;
>>   	int ret;
>>   
>>   	/* used later by cpuset_attach() */
>> -	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
>> +	oldcs = task_cs(cgroup_taskset_first(tset, &css));
>>   	cs = css_cs(css);
>>   
>>   	percpu_down_write(&cpuset_rwsem);
>>   
>> +	/*
>> +	 * Only one cpuset attach operation is allowed for each cpuset.
>> +	 */
>> +	ret = -EBUSY;
>> +	if (cs->attach_in_progress)
>> +		goto out_unlock;
> That'll mean CLONE_INTO_CGROUP becomes even more interestig because it
> isn't subject to this restriction in contrast to fork()+migrate, right?

This patch is not related to the CLONE_INTO_CGROUP. In fact, I have 
dropped that in latter version.

Cheers,
Longman

Re: [PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Christian Brauner 2 years, 10 months ago

On Fri, Apr 07, 2023 at 11:30:14AM -0400, Waiman Long wrote:
> On 4/6/23 05:44, Christian Brauner wrote:
> > On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long wrote:
> > > The current cpuset code uses the global cpuset_attach_old_cs variable
> > > to store the old cpuset value between consecutive cpuset_can_attach()
> > > and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
> > > not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
> > > attach operations are possible.
> > > 
> > > When there are concurrent cpuset attach operations in progress,
> > > cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
> > > causing incorrect result.  To avoid this problem while still allowing
> > > certain level of parallelism, drop cpuset_attach_old_cs and use a
> > > per-cpuset attach_old_cs value. Also restrict to at most one active
> > > attach operation per cpuset to avoid corrupting the value of the
> > > per-cpuset attach_old_cs value.
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > ---
> > >   kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
> > >   1 file changed, 14 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > > index 2367de611c42..3f925c261513 100644
> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -198,6 +198,8 @@ struct cpuset {
> > >   	/* Handle for cpuset.cpus.partition */
> > >   	struct cgroup_file partition_file;
> > > +
> > > +	struct cpuset *attach_old_cs;
> > >   };
> > >   /*
> > > @@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
> > >   	return val;
> > >   }
> > > -static struct cpuset *cpuset_attach_old_cs;
> > > -
> > >   /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
> > >   static int cpuset_can_attach(struct cgroup_taskset *tset)
> > >   {
> > >   	struct cgroup_subsys_state *css;
> > > -	struct cpuset *cs;
> > > +	struct cpuset *cs, *oldcs;
> > >   	struct task_struct *task;
> > >   	int ret;
> > >   	/* used later by cpuset_attach() */
> > > -	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> > > +	oldcs = task_cs(cgroup_taskset_first(tset, &css));
> > >   	cs = css_cs(css);
> > >   	percpu_down_write(&cpuset_rwsem);
> > > +	/*
> > > +	 * Only one cpuset attach operation is allowed for each cpuset.
> > > +	 */
> > > +	ret = -EBUSY;
> > > +	if (cs->attach_in_progress)
> > > +		goto out_unlock;
> > That'll mean CLONE_INTO_CGROUP becomes even more interestig because it
> > isn't subject to this restriction in contrast to fork()+migrate, right?
> 
> This patch is not related to the CLONE_INTO_CGROUP. In fact, I have dropped

I know that. My point had been that this patch introduced a new
restriction affecting parallelism in the migration path while the
CLONE_INTO_CGROUP path remained unaffected.

Re: [PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Michal Koutný 2 years, 10 months ago

On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long <longman@redhat.com> wrote:
> The current cpuset code uses the global cpuset_attach_old_cs variable
> to store the old cpuset value between consecutive cpuset_can_attach()
> and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
> not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
> attach operations are possible.

Do I understand correctly this consequence of the cpuset_attach_task()
on the clone path?
In that particular case (with CLONE_INTO_CGROUP) cgroup_mutex is taken,
so the access the the old_cs variable should still be synchronized with
regular migrations that are also under cgroup_mutex.

> When there are concurrent cpuset attach operations in progress,
> cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
> causing incorrect result.  To avoid this problem while still allowing
> certain level of parallelism, drop cpuset_attach_old_cs and use a
> per-cpuset attach_old_cs value. Also restrict to at most one active
> attach operation per cpuset to avoid corrupting the value of the
> per-cpuset attach_old_cs value.

Secondly, semantically wouldn't a `void *ss_priv[CGROUP_SUBSYS_COUNT]`
in struct cgroup_taskset make it simpler wrt the exclusivity guarantees?

Thirdly, if my initial assumptino is right -- I'd suggest ordering this
before the patch `cgroup/cpuset: Make cpuset_fork() handle
CLONE_INTO_CGROUP properly` to spare backporters possible troubles if
this is would be a fixup to that.

Thanks,
Michal

Re: [PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Waiman Long 2 years, 10 months ago

On 4/3/23 12:47, Michal Koutný wrote:
> On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long <longman@redhat.com> wrote:
>> The current cpuset code uses the global cpuset_attach_old_cs variable
>> to store the old cpuset value between consecutive cpuset_can_attach()
>> and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
>> not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
>> attach operations are possible.
> Do I understand correctly this consequence of the cpuset_attach_task()
> on the clone path?
> In that particular case (with CLONE_INTO_CGROUP) cgroup_mutex is taken,
> so the access the the old_cs variable should still be synchronized with
> regular migrations that are also under cgroup_mutex.

This patch is actually not related to the CLONE_INTO_GROUP problem in 
patch 1. It is a generic problem when multiple users are moving threads 
into cgroup.threads of the same or different cpusets simultaneously.


>> When there are concurrent cpuset attach operations in progress,
>> cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
>> causing incorrect result.  To avoid this problem while still allowing
>> certain level of parallelism, drop cpuset_attach_old_cs and use a
>> per-cpuset attach_old_cs value. Also restrict to at most one active
>> attach operation per cpuset to avoid corrupting the value of the
>> per-cpuset attach_old_cs value.
> Secondly, semantically wouldn't a `void *ss_priv[CGROUP_SUBSYS_COUNT]`
> in struct cgroup_taskset make it simpler wrt the exclusivity guarantees?

I guess we can put the old_cs value into cgroup_taskset. Since the 
related attach_in_progress value is in cpuset, so I put the old_cs there 
too.


> Thirdly, if my initial assumptino is right -- I'd suggest ordering this
> before the patch `cgroup/cpuset: Make cpuset_fork() handle
> CLONE_INTO_CGROUP properly` to spare backporters possible troubles if
> this is would be a fixup to that.

I don't believe this patch has a dependency on patch 1.

I had thought about adding a fixes tag so it will be backported to 
stable. However, this problem should be rare. Let's see what others 
think about it.

Cheers,
Longman

Re: [PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Michal Koutný 2 years, 10 months ago

Hi.

On Mon, Apr 03, 2023 at 01:41:33PM -0400, Waiman Long <longman@redhat.com> wrote:
> This patch is actually not related to the CLONE_INTO_GROUP problem in patch
> 1. It is a generic problem when multiple users are moving threads into
> cgroup.threads of the same or different cpusets simultaneously.

I meant this:
	__cgroup_procs_write
	  cgroup_kn_lock_live
	    mutex_lock(&cgroup_mutex)

and (more succintly)
	cgroup_update_dfl_csses
	  lockdep_assert_held(&cgroup_mutex)

Even the threaded migrations should be synchronized here.
Can you please explain in more detail what's the problematic case?

> I don't believe this patch has a dependency on patch 1.

(I meant the opposite, patch 1 would depend in this 3/3. But maybe this
one isn't need.)

Michal

Re: [PATCH 3/3] cgroup/cpuset: Allow only one active attach operation per cpuset

Posted by Waiman Long 2 years, 10 months ago

On 4/4/23 05:07, Michal Koutný wrote:
> Hi.
>
> On Mon, Apr 03, 2023 at 01:41:33PM -0400, Waiman Long <longman@redhat.com> wrote:
>> This patch is actually not related to the CLONE_INTO_GROUP problem in patch
>> 1. It is a generic problem when multiple users are moving threads into
>> cgroup.threads of the same or different cpusets simultaneously.
> I meant this:
> 	__cgroup_procs_write
> 	  cgroup_kn_lock_live
> 	    mutex_lock(&cgroup_mutex)
>
> and (more succintly)
> 	cgroup_update_dfl_csses
> 	  lockdep_assert_held(&cgroup_mutex)
>
> Even the threaded migrations should be synchronized here.
> Can you please explain in more detail what's the problematic case?

You are right. I missed the cgroup_mutex synchronization here. So this 
patch isn't needed. I will drop it in the next version.

Cheers,
Longman