kernel/cgroup/cpuset.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
From: Chen Ridong <chenridong@huawei.com>
Currently, the check for whether a partition is populated does not
account for tasks in the process of attaching. This is a corner case
that can leave a task stuck in a partition with no effective CPUs.
The race condition occurs as follows:
cpu0 cpu1
//cpuset A with cpu N
migrate task p to A
cpuset_can_attach
// with effective cpus
// check ok
// cpuset_mutex is not held // clear cpuset.cpus.exclusive
// making effective cpus empty
update_exclusive_cpumask
// tasks_nocpu_error check ok
// empty effective cpus, partition valid
cpuset_attach
...
// task p stays in A, with non-effective cpus.
To fix this issue, this patch introduces cs_is_populated, which considers
tasks in the attaching process. This new helper is used in validate_change
and partition_is_populated.
Fixes: e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
kernel/cgroup/cpuset.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index daf813386260..3ffa0c02a020 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -356,6 +356,13 @@ static inline bool is_in_v2_mode(void)
(cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
}
+static inline bool cs_is_populated(struct cpuset *cs)
+{
+ /* Tasks in the process of attaching should be considered as populated */
+ return cgroup_is_populated(cs->css.cgroup) ||
+ cs->attach_in_progress;
+}
+
/**
* partition_is_populated - check if partition has tasks
* @cs: partition root to be checked
@@ -376,7 +383,8 @@ static inline bool partition_is_populated(struct cpuset *cs,
struct cgroup_subsys_state *css;
struct cpuset *child;
- if (cs->css.cgroup->nr_populated_csets)
+ if (cs->css.cgroup->nr_populated_csets ||
+ cs->attach_in_progress)
return true;
rcu_read_lock();
@@ -385,7 +393,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
continue;
if (is_partition_valid(child))
continue;
- if (cgroup_is_populated(child->css.cgroup)) {
+ if (cs_is_populated(child)) {
rcu_read_unlock();
return true;
}
@@ -670,7 +678,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
* be changed to have empty cpus_allowed or mems_allowed.
*/
ret = -ENOSPC;
- if ((cgroup_is_populated(cur->css.cgroup) || cur->attach_in_progress)) {
+ if (cs_is_populated(cur)) {
if (!cpumask_empty(cur->cpus_allowed) &&
cpumask_empty(trial->cpus_allowed))
goto out;
--
2.34.1
Hi Ridong.
On Tue, Nov 11, 2025 at 01:26:32PM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
...
> +static inline bool cs_is_populated(struct cpuset *cs)
> +{
> + /* Tasks in the process of attaching should be considered as populated */
> + return cgroup_is_populated(cs->css.cgroup) ||
> + cs->attach_in_progress;
> +}
s/process/cpuset/ in the subject
and
s/Tasks/Cpusets/ in the comment above
Also, should there be some lockdep_assert_held() in this helper (for
documentation purposes but also for correctly synchronized validity of
the returned value.)
Thanks,
Michal
On 11/11/25 9:01 AM, Michal Koutný wrote:
> Hi Ridong.
>
> On Tue, Nov 11, 2025 at 01:26:32PM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> ...
>> +static inline bool cs_is_populated(struct cpuset *cs)
>> +{
>> + /* Tasks in the process of attaching should be considered as populated */
>> + return cgroup_is_populated(cs->css.cgroup) ||
>> + cs->attach_in_progress;
>> +}
> s/process/cpuset/ in the subject
> and
> s/Tasks/Cpusets/ in the comment above
Agreed.
> Also, should there be some lockdep_assert_held() in this helper (for
> documentation purposes but also for correctly synchronized validity of
> the returned value.)
A lockdep_assert_held() is certainly needed if it is an externally
visible helper that can be called outside cpuset. For internal helper
like this one, we may not really need that as almost all the code in
cpuset.c are within either a cpuset_mutex or callback_lock critical
sections. So I am fine with or without it.
Cheers,
Longman
On 2025/11/11 23:16, Waiman Long wrote:
> On 11/11/25 9:01 AM, Michal Koutný wrote:
>> Hi Ridong.
>>
>> On Tue, Nov 11, 2025 at 01:26:32PM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> ...
>>> +static inline bool cs_is_populated(struct cpuset *cs)
>>> +{
>>> + /* Tasks in the process of attaching should be considered as populated */
>>> + return cgroup_is_populated(cs->css.cgroup) ||
>>> + cs->attach_in_progress;
>>> +}
>> s/process/cpuset/ in the subject
>> and
>> s/Tasks/Cpusets/ in the comment above
> Agreed.
Thanks, will update.
--
Best regards,
Ridong
On Tue, Nov 11, 2025 at 10:16:33AM -0500, Waiman Long <llong@redhat.com> wrote: > For internal helper like this one, we may not really need that as > almost all the code in cpuset.c are within either a cpuset_mutex or > callback_lock critical sections. So I am fine with or without it. OK, cpuset_mutex and callback_lock are close but cgroup_is_populated() that caught my eye would also need cgroup_mutex otherwise "the result can only be used as a hint" (quote from cgroup.h). Or is it safe to assume that cpuset_mutex inside cpuset_attach() is sufficient to always (incl. exits) ensure stability of cgroup_is_populated() result? Anyway, I'd find some clarifications in the commit message or the surrounding code about this helpful. (Judgment call, whether with a lockdep macro. My opinion is -- why not.) Thanks, Michal
On 11/11/25 2:25 PM, Michal Koutný wrote: > On Tue, Nov 11, 2025 at 10:16:33AM -0500, Waiman Long <llong@redhat.com> wrote: >> For internal helper like this one, we may not really need that as >> almost all the code in cpuset.c are within either a cpuset_mutex or >> callback_lock critical sections. So I am fine with or without it. > OK, cpuset_mutex and callback_lock are close but cgroup_is_populated() > that caught my eye would also need cgroup_mutex otherwise "the result > can only be used as a hint" (quote from cgroup.h). > > Or is it safe to assume that cpuset_mutex inside cpuset_attach() is > sufficient to always (incl. exits) ensure stability of > cgroup_is_populated() result? > > Anyway, I'd find some clarifications in the commit message or the > surrounding code about this helpful. (Judgment call, whether with a > lockdep macro. My opinion is -- why not.) For attach_in_progress, it is protected by the cpuset_mutex. So it may make sense to add a lockdep_assert_held() for that. You are right that there are problems WRT the stability of cgroup_is_populated() value. I think "cgrp->nr_populated_csets + cs->attach_in_progress" should be almost stable for the cgroup itself with cpuset_mutex, but there can be a small timing window after cpuset_attach(), but before the stat is updated where the sum is 0, but there are actually tasks in the cgroup. For "cgrp->nr_populated_domain_children + cgrp->nr_populated_threaded_children", it also has the problem that the sum can be 0 but there are attach_in_progress set in one or more of the child cgroups. So even with this patch, we can't guarantee 100% that there can be no task in the partition even if it has empty effective_cpus. It is only a problem for nested local partitions though as remote partitions are not allowed to exhaust all the CPUs from root cgroup. We should probably document that limitation to warn users if they try to create nested local partitions where the parent partition root of the child partitions has empty effective_cpus. Cheers, Longman
On 2025/11/12 4:35, Waiman Long wrote:
> On 11/11/25 2:25 PM, Michal Koutný wrote:
>> On Tue, Nov 11, 2025 at 10:16:33AM -0500, Waiman Long <llong@redhat.com> wrote:
>>> For internal helper like this one, we may not really need that as
>>> almost all the code in cpuset.c are within either a cpuset_mutex or
>>> callback_lock critical sections. So I am fine with or without it.
>> OK, cpuset_mutex and callback_lock are close but cgroup_is_populated()
>> that caught my eye would also need cgroup_mutex otherwise "the result
>> can only be used as a hint" (quote from cgroup.h).
>>
>> Or is it safe to assume that cpuset_mutex inside cpuset_attach() is
>> sufficient to always (incl. exits) ensure stability of
>> cgroup_is_populated() result?
>>
>> Anyway, I'd find some clarifications in the commit message or the
>> surrounding code about this helpful. (Judgment call, whether with a
>> lockdep macro. My opinion is -- why not.)
>
> For attach_in_progress, it is protected by the cpuset_mutex. So it may make sense to add a
> lockdep_assert_held() for that.
>
Will add.
> You are right that there are problems WRT the stability of cgroup_is_populated() value.
>
> I think "cgrp->nr_populated_csets + cs->attach_in_progress" should be almost stable for the cgroup
> itself with cpuset_mutex, but there can be a small timing window after cpuset_attach(), but before
> the stat is updated where the sum is 0, but there are actually tasks in the cgroup.
>
Do you mean there’s a small window after ss->attach (i.e., cpuset_attach) where
cgrp->nr_populated_csets + cs->attach_in_progress could be 0?
If I understand correctly:
ss->can_attach: cs->attach_in_progress++, sum > 0
css_set_move_task->css_set_update_populated: cgrp->nr_populated_csets++, sum > 0
ss->attach->cpuset_attach: cs->attach_in_progress--, sum > 0
What exactly is the small window you’re referring to where the sum equals 0?
> For "cgrp->nr_populated_domain_children + cgrp->nr_populated_threaded_children", it also has the
> problem that the sum can be 0 but there are attach_in_progress set in one or more of the child
> cgroups. So even with this patch, we can't guarantee 100% that there can be no task in the partition
> even if it has empty effective_cpus. It is only a problem for nested local partitions though as
> remote partitions are not allowed to exhaust all the CPUs from root cgroup.
>
> We should probably document that limitation to warn users if they try to create nested local
> partitions where the parent partition root of the child partitions has empty effective_cpus.
>
Hmm, but it was what the commit e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty
cpuset.cpus.effective") allowed, and it makes sense.
--
Best regards,
Ridong
On 11/11/25 8:58 PM, Chen Ridong wrote:
>
> On 2025/11/12 4:35, Waiman Long wrote:
>> On 11/11/25 2:25 PM, Michal Koutný wrote:
>>> On Tue, Nov 11, 2025 at 10:16:33AM -0500, Waiman Long <llong@redhat.com> wrote:
>>>> For internal helper like this one, we may not really need that as
>>>> almost all the code in cpuset.c are within either a cpuset_mutex or
>>>> callback_lock critical sections. So I am fine with or without it.
>>> OK, cpuset_mutex and callback_lock are close but cgroup_is_populated()
>>> that caught my eye would also need cgroup_mutex otherwise "the result
>>> can only be used as a hint" (quote from cgroup.h).
>>>
>>> Or is it safe to assume that cpuset_mutex inside cpuset_attach() is
>>> sufficient to always (incl. exits) ensure stability of
>>> cgroup_is_populated() result?
>>>
>>> Anyway, I'd find some clarifications in the commit message or the
>>> surrounding code about this helpful. (Judgment call, whether with a
>>> lockdep macro. My opinion is -- why not.)
>> For attach_in_progress, it is protected by the cpuset_mutex. So it may make sense to add a
>> lockdep_assert_held() for that.
>>
> Will add.
>
>> You are right that there are problems WRT the stability of cgroup_is_populated() value.
>>
>> I think "cgrp->nr_populated_csets + cs->attach_in_progress" should be almost stable for the cgroup
>> itself with cpuset_mutex, but there can be a small timing window after cpuset_attach(), but before
>> the stat is updated where the sum is 0, but there are actually tasks in the cgroup.
>>
> Do you mean there’s a small window after ss->attach (i.e., cpuset_attach) where
> cgrp->nr_populated_csets + cs->attach_in_progress could be 0?
>
> If I understand correctly:
>
> ss->can_attach: cs->attach_in_progress++, sum > 0
> css_set_move_task->css_set_update_populated: cgrp->nr_populated_csets++, sum > 0
> ss->attach->cpuset_attach: cs->attach_in_progress--, sum > 0
>
> What exactly is the small window you’re referring to where the sum equals 0?
Yes, the nr_populated_csets is incremented before calling
cpuset_attach(). I think I had mixed up the ordering. Thanks for the
clarification.
>
>> For "cgrp->nr_populated_domain_children + cgrp->nr_populated_threaded_children", it also has the
>> problem that the sum can be 0 but there are attach_in_progress set in one or more of the child
>> cgroups. So even with this patch, we can't guarantee 100% that there can be no task in the partition
>> even if it has empty effective_cpus. It is only a problem for nested local partitions though as
>> remote partitions are not allowed to exhaust all the CPUs from root cgroup.
>>
>> We should probably document that limitation to warn users if they try to create nested local
>> partitions where the parent partition root of the child partitions has empty effective_cpus.
>>
> Hmm, but it was what the commit e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty
> cpuset.cpus.effective") allowed, and it makes sense.
This commit will allow user to not waste a CPU if users want to set up a
nested local partitions where there is no task in some intermediate
levels. One way to address this gap is to iterate down the cpuset
hierarchy to check if any of its attach_in_progress count is set. So it
is a solvable problem.
There is also a flip side where cgroup_is_populated() returns a non-zero
value, but the tasks are actively being moved away. This false positive
is less a problem even though it can cause a failure in distributing out
all the CPUs to child partitions. The users can retry and it should work
the second time.
Cheers,
Longman
On 2025/11/12 10:21, Waiman Long wrote:
> On 11/11/25 8:58 PM, Chen Ridong wrote:
>>
>> On 2025/11/12 4:35, Waiman Long wrote:
>>> On 11/11/25 2:25 PM, Michal Koutný wrote:
>>>> On Tue, Nov 11, 2025 at 10:16:33AM -0500, Waiman Long <llong@redhat.com> wrote:
>>>>> For internal helper like this one, we may not really need that as
>>>>> almost all the code in cpuset.c are within either a cpuset_mutex or
>>>>> callback_lock critical sections. So I am fine with or without it.
>>>> OK, cpuset_mutex and callback_lock are close but cgroup_is_populated()
>>>> that caught my eye would also need cgroup_mutex otherwise "the result
>>>> can only be used as a hint" (quote from cgroup.h).
>>>>
>>>> Or is it safe to assume that cpuset_mutex inside cpuset_attach() is
>>>> sufficient to always (incl. exits) ensure stability of
>>>> cgroup_is_populated() result?
>>>>
>>>> Anyway, I'd find some clarifications in the commit message or the
>>>> surrounding code about this helpful. (Judgment call, whether with a
>>>> lockdep macro. My opinion is -- why not.)
>>> For attach_in_progress, it is protected by the cpuset_mutex. So it may make sense to add a
>>> lockdep_assert_held() for that.
>>>
>> Will add.
>>
>>> You are right that there are problems WRT the stability of cgroup_is_populated() value.
>>>
>>> I think "cgrp->nr_populated_csets + cs->attach_in_progress" should be almost stable for the cgroup
>>> itself with cpuset_mutex, but there can be a small timing window after cpuset_attach(), but before
>>> the stat is updated where the sum is 0, but there are actually tasks in the cgroup.
>>>
>> Do you mean there’s a small window after ss->attach (i.e., cpuset_attach) where
>> cgrp->nr_populated_csets + cs->attach_in_progress could be 0?
>>
>> If I understand correctly:
>>
>> ss->can_attach: cs->attach_in_progress++, sum > 0
>> css_set_move_task->css_set_update_populated: cgrp->nr_populated_csets++, sum > 0
>> ss->attach->cpuset_attach: cs->attach_in_progress--, sum > 0
>>
>> What exactly is the small window you’re referring to where the sum equals 0?
> Yes, the nr_populated_csets is incremented before calling cpuset_attach(). I think I had mixed up
> the ordering. Thanks for the clarification.
>>
>>> For "cgrp->nr_populated_domain_children + cgrp->nr_populated_threaded_children", it also has the
>>> problem that the sum can be 0 but there are attach_in_progress set in one or more of the child
>>> cgroups. So even with this patch, we can't guarantee 100% that there can be no task in the partition
>>> even if it has empty effective_cpus. It is only a problem for nested local partitions though as
>>> remote partitions are not allowed to exhaust all the CPUs from root cgroup.
>>>
>>> We should probably document that limitation to warn users if they try to create nested local
>>> partitions where the parent partition root of the child partitions has empty effective_cpus.
>>>
>> Hmm, but it was what the commit e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty
>> cpuset.cpus.effective") allowed, and it makes sense.
>
> This commit will allow user to not waste a CPU if users want to set up a nested local partitions
> where there is no task in some intermediate levels. One way to address this gap is to iterate down
> the cpuset hierarchy to check if any of its attach_in_progress count is set. So it is a solvable
> problem.
>
Yes, the modification needed is to replace cpuset_for_each_child with cpuset_for_each_descendant_pre
in partition_is_populated.
> There is also a flip side where cgroup_is_populated() returns a non-zero value, but the tasks are
> actively being moved away. This false positive is less a problem even though it can cause a failure
> in distributing out all the CPUs to child partitions. The users can retry and it should work the
> second time.
>
Agree. This is a short-lived state, and it’s difficult to determine if any tasks are in the
partition during this period. Users should retry.
--
Best regards,
Ridong
© 2016 - 2026 Red Hat, Inc.