The current cpuset code uses the global cpuset_attach_old_cs variable
to store the old cpuset value between consecutive cpuset_can_attach()
and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
attach operations are possible.
When there are concurrent cpuset attach operations in progress,
cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
causing incorrect result. To avoid this problem while still allowing
certain level of parallelism, drop cpuset_attach_old_cs and use a
per-cpuset attach_old_cs value. Also restrict to at most one active
attach operation per cpuset to avoid corrupting the value of the
per-cpuset attach_old_cs value.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2367de611c42..3f925c261513 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -198,6 +198,8 @@ struct cpuset {
/* Handle for cpuset.cpus.partition */
struct cgroup_file partition_file;
+
+ struct cpuset *attach_old_cs;
};
/*
@@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
return val;
}
-static struct cpuset *cpuset_attach_old_cs;
-
/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
- struct cpuset *cs;
+ struct cpuset *cs, *oldcs;
struct task_struct *task;
int ret;
/* used later by cpuset_attach() */
- cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
+ oldcs = task_cs(cgroup_taskset_first(tset, &css));
cs = css_cs(css);
percpu_down_write(&cpuset_rwsem);
+ /*
+ * Only one cpuset attach operation is allowed for each cpuset.
+ */
+ ret = -EBUSY;
+ if (cs->attach_in_progress)
+ goto out_unlock;
+
/* allow moving tasks into an empty cpuset if on default hierarchy */
ret = -ENOSPC;
if (!is_in_v2_mode() &&
@@ -2498,6 +2505,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* changes which zero cpus/mems_allowed.
*/
cs->attach_in_progress++;
+ cs->attach_old_cs = oldcs;
ret = 0;
out_unlock:
percpu_up_write(&cpuset_rwsem);
@@ -2548,7 +2556,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct task_struct *leader;
struct cgroup_subsys_state *css;
struct cpuset *cs;
- struct cpuset *oldcs = cpuset_attach_old_cs;
+ struct cpuset *oldcs;
bool cpus_updated, mems_updated;
cgroup_taskset_first(tset, &css);
@@ -2556,6 +2564,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
percpu_down_write(&cpuset_rwsem);
+ oldcs = cs->attach_old_cs;
cpus_updated = !cpumask_equal(cs->effective_cpus,
oldcs->effective_cpus);
mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
--
2.31.1
On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long wrote:
> The current cpuset code uses the global cpuset_attach_old_cs variable
> to store the old cpuset value between consecutive cpuset_can_attach()
> and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
> not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
> attach operations are possible.
>
> When there are concurrent cpuset attach operations in progress,
> cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
> causing incorrect result. To avoid this problem while still allowing
> certain level of parallelism, drop cpuset_attach_old_cs and use a
> per-cpuset attach_old_cs value. Also restrict to at most one active
> attach operation per cpuset to avoid corrupting the value of the
> per-cpuset attach_old_cs value.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 2367de611c42..3f925c261513 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -198,6 +198,8 @@ struct cpuset {
>
> /* Handle for cpuset.cpus.partition */
> struct cgroup_file partition_file;
> +
> + struct cpuset *attach_old_cs;
> };
>
> /*
> @@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
> return val;
> }
>
> -static struct cpuset *cpuset_attach_old_cs;
> -
> /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
> static int cpuset_can_attach(struct cgroup_taskset *tset)
> {
> struct cgroup_subsys_state *css;
> - struct cpuset *cs;
> + struct cpuset *cs, *oldcs;
> struct task_struct *task;
> int ret;
>
> /* used later by cpuset_attach() */
> - cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> + oldcs = task_cs(cgroup_taskset_first(tset, &css));
> cs = css_cs(css);
>
> percpu_down_write(&cpuset_rwsem);
>
> + /*
> + * Only one cpuset attach operation is allowed for each cpuset.
> + */
> + ret = -EBUSY;
> + if (cs->attach_in_progress)
> + goto out_unlock;
That'll mean CLONE_INTO_CGROUP becomes even more interestig because it
isn't subject to this restriction in contrast to fork()+migrate, right?
On 4/6/23 05:44, Christian Brauner wrote:
> On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long wrote:
>> The current cpuset code uses the global cpuset_attach_old_cs variable
>> to store the old cpuset value between consecutive cpuset_can_attach()
>> and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
>> not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
>> attach operations are possible.
>>
>> When there are concurrent cpuset attach operations in progress,
>> cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
>> causing incorrect result. To avoid this problem while still allowing
>> certain level of parallelism, drop cpuset_attach_old_cs and use a
>> per-cpuset attach_old_cs value. Also restrict to at most one active
>> attach operation per cpuset to avoid corrupting the value of the
>> per-cpuset attach_old_cs value.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
>> 1 file changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 2367de611c42..3f925c261513 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -198,6 +198,8 @@ struct cpuset {
>>
>> /* Handle for cpuset.cpus.partition */
>> struct cgroup_file partition_file;
>> +
>> + struct cpuset *attach_old_cs;
>> };
>>
>> /*
>> @@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
>> return val;
>> }
>>
>> -static struct cpuset *cpuset_attach_old_cs;
>> -
>> /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
>> static int cpuset_can_attach(struct cgroup_taskset *tset)
>> {
>> struct cgroup_subsys_state *css;
>> - struct cpuset *cs;
>> + struct cpuset *cs, *oldcs;
>> struct task_struct *task;
>> int ret;
>>
>> /* used later by cpuset_attach() */
>> - cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
>> + oldcs = task_cs(cgroup_taskset_first(tset, &css));
>> cs = css_cs(css);
>>
>> percpu_down_write(&cpuset_rwsem);
>>
>> + /*
>> + * Only one cpuset attach operation is allowed for each cpuset.
>> + */
>> + ret = -EBUSY;
>> + if (cs->attach_in_progress)
>> + goto out_unlock;
> That'll mean CLONE_INTO_CGROUP becomes even more interestig because it
> isn't subject to this restriction in contrast to fork()+migrate, right?
This patch is not related to the CLONE_INTO_CGROUP. In fact, I have
dropped that in latter version.
Cheers,
Longman
On Fri, Apr 07, 2023 at 11:30:14AM -0400, Waiman Long wrote:
> On 4/6/23 05:44, Christian Brauner wrote:
> > On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long wrote:
> > > The current cpuset code uses the global cpuset_attach_old_cs variable
> > > to store the old cpuset value between consecutive cpuset_can_attach()
> > > and cpuset_attach() calls. Since a caller of cpuset_can_attach() may
> > > not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset
> > > attach operations are possible.
> > >
> > > When there are concurrent cpuset attach operations in progress,
> > > cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs
> > > causing incorrect result. To avoid this problem while still allowing
> > > certain level of parallelism, drop cpuset_attach_old_cs and use a
> > > per-cpuset attach_old_cs value. Also restrict to at most one active
> > > attach operation per cpuset to avoid corrupting the value of the
> > > per-cpuset attach_old_cs value.
> > >
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > ---
> > > kernel/cgroup/cpuset.c | 19 ++++++++++++++-----
> > > 1 file changed, 14 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > > index 2367de611c42..3f925c261513 100644
> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -198,6 +198,8 @@ struct cpuset {
> > > /* Handle for cpuset.cpus.partition */
> > > struct cgroup_file partition_file;
> > > +
> > > + struct cpuset *attach_old_cs;
> > > };
> > > /*
> > > @@ -2456,22 +2458,27 @@ static int fmeter_getrate(struct fmeter *fmp)
> > > return val;
> > > }
> > > -static struct cpuset *cpuset_attach_old_cs;
> > > -
> > > /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
> > > static int cpuset_can_attach(struct cgroup_taskset *tset)
> > > {
> > > struct cgroup_subsys_state *css;
> > > - struct cpuset *cs;
> > > + struct cpuset *cs, *oldcs;
> > > struct task_struct *task;
> > > int ret;
> > > /* used later by cpuset_attach() */
> > > - cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> > > + oldcs = task_cs(cgroup_taskset_first(tset, &css));
> > > cs = css_cs(css);
> > > percpu_down_write(&cpuset_rwsem);
> > > + /*
> > > + * Only one cpuset attach operation is allowed for each cpuset.
> > > + */
> > > + ret = -EBUSY;
> > > + if (cs->attach_in_progress)
> > > + goto out_unlock;
> > That'll mean CLONE_INTO_CGROUP becomes even more interestig because it
> > isn't subject to this restriction in contrast to fork()+migrate, right?
>
> This patch is not related to the CLONE_INTO_CGROUP. In fact, I have dropped
I know that. My point had been that this patch introduced a new
restriction affecting parallelism in the migration path while the
CLONE_INTO_CGROUP path remained unaffected.
On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long <longman@redhat.com> wrote: > The current cpuset code uses the global cpuset_attach_old_cs variable > to store the old cpuset value between consecutive cpuset_can_attach() > and cpuset_attach() calls. Since a caller of cpuset_can_attach() may > not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset > attach operations are possible. Do I understand correctly this consequence of the cpuset_attach_task() on the clone path? In that particular case (with CLONE_INTO_CGROUP) cgroup_mutex is taken, so the access the the old_cs variable should still be synchronized with regular migrations that are also under cgroup_mutex. > When there are concurrent cpuset attach operations in progress, > cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs > causing incorrect result. To avoid this problem while still allowing > certain level of parallelism, drop cpuset_attach_old_cs and use a > per-cpuset attach_old_cs value. Also restrict to at most one active > attach operation per cpuset to avoid corrupting the value of the > per-cpuset attach_old_cs value. Secondly, semantically wouldn't a `void *ss_priv[CGROUP_SUBSYS_COUNT]` in struct cgroup_taskset make it simpler wrt the exclusivity guarantees? Thirdly, if my initial assumptino is right -- I'd suggest ordering this before the patch `cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly` to spare backporters possible troubles if this is would be a fixup to that. Thanks, Michal
On 4/3/23 12:47, Michal Koutný wrote: > On Fri, Mar 31, 2023 at 10:50:45AM -0400, Waiman Long <longman@redhat.com> wrote: >> The current cpuset code uses the global cpuset_attach_old_cs variable >> to store the old cpuset value between consecutive cpuset_can_attach() >> and cpuset_attach() calls. Since a caller of cpuset_can_attach() may >> not need to hold the global cgroup_threadgroup_rwsem, parallel cpuset >> attach operations are possible. > Do I understand correctly this consequence of the cpuset_attach_task() > on the clone path? > In that particular case (with CLONE_INTO_CGROUP) cgroup_mutex is taken, > so the access the the old_cs variable should still be synchronized with > regular migrations that are also under cgroup_mutex. This patch is actually not related to the CLONE_INTO_GROUP problem in patch 1. It is a generic problem when multiple users are moving threads into cgroup.threads of the same or different cpusets simultaneously. >> When there are concurrent cpuset attach operations in progress, >> cpuset_attach() may fetch the wrong value from cpuset_attach_old_cs >> causing incorrect result. To avoid this problem while still allowing >> certain level of parallelism, drop cpuset_attach_old_cs and use a >> per-cpuset attach_old_cs value. Also restrict to at most one active >> attach operation per cpuset to avoid corrupting the value of the >> per-cpuset attach_old_cs value. > Secondly, semantically wouldn't a `void *ss_priv[CGROUP_SUBSYS_COUNT]` > in struct cgroup_taskset make it simpler wrt the exclusivity guarantees? I guess we can put the old_cs value into cgroup_taskset. Since the related attach_in_progress value is in cpuset, so I put the old_cs there too. > Thirdly, if my initial assumptino is right -- I'd suggest ordering this > before the patch `cgroup/cpuset: Make cpuset_fork() handle > CLONE_INTO_CGROUP properly` to spare backporters possible troubles if > this is would be a fixup to that. I don't believe this patch has a dependency on patch 1. I had thought about adding a fixes tag so it will be backported to stable. However, this problem should be rare. Let's see what others think about it. Cheers, Longman
Hi. On Mon, Apr 03, 2023 at 01:41:33PM -0400, Waiman Long <longman@redhat.com> wrote: > This patch is actually not related to the CLONE_INTO_GROUP problem in patch > 1. It is a generic problem when multiple users are moving threads into > cgroup.threads of the same or different cpusets simultaneously. I meant this: __cgroup_procs_write cgroup_kn_lock_live mutex_lock(&cgroup_mutex) and (more succintly) cgroup_update_dfl_csses lockdep_assert_held(&cgroup_mutex) Even the threaded migrations should be synchronized here. Can you please explain in more detail what's the problematic case? > I don't believe this patch has a dependency on patch 1. (I meant the opposite, patch 1 would depend in this 3/3. But maybe this one isn't need.) Michal
On 4/4/23 05:07, Michal Koutný wrote: > Hi. > > On Mon, Apr 03, 2023 at 01:41:33PM -0400, Waiman Long <longman@redhat.com> wrote: >> This patch is actually not related to the CLONE_INTO_GROUP problem in patch >> 1. It is a generic problem when multiple users are moving threads into >> cgroup.threads of the same or different cpusets simultaneously. > I meant this: > __cgroup_procs_write > cgroup_kn_lock_live > mutex_lock(&cgroup_mutex) > > and (more succintly) > cgroup_update_dfl_csses > lockdep_assert_held(&cgroup_mutex) > > Even the threaded migrations should be synchronized here. > Can you please explain in more detail what's the problematic case? You are right. I missed the cgroup_mutex synchronization here. So this patch isn't needed. I will drop it in the next version. Cheers, Longman
© 2016 - 2026 Red Hat, Inc.