It was found that any change to the current cpuset hierarchy may reset
the cpumask of the tasks in the affected cpusets to the default cpuset
value even if those tasks have cpus affinity explicitly set by the users
before. That is especially easy to trigger under a cgroup v2 environment
where writing "+cpuset" to the root cgroup's cgroup.subtree_control
file will reset the cpus affinity of all the processes in the system.
That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.
Fix this problem by looking at user_cpus_ptr which will be set if
cpus affinity have been explicitly set before and use it to restrcit
the given cpumask unless there is no overlap. In that case, it will
fallback to the given one.
To handle possible racing with concurrent sched_setaffinity() call, the
user_cpus_ptr is rechecked again after a successful set_cpus_allowed_ptr()
call. If the user_cpus_ptr status changes, the operation is retried
again with the newly assigned user_cpus_ptr.
With that change in place, it was verified that tasks that have its
cpus affinity explicitly set will not be affected by changes made to
the v2 cgroup.subtree_control files.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 42 ++++++++++++++++++++++++++++++++++++++++--
1 file changed, 40 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 58aadfda9b8b..a663848d0459 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -704,6 +704,44 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
return ret;
}
+/*
+ * Preserve user provided cpumask (if set) as much as possible unless there
+ * is no overlap with the given mask.
+ */
+static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *mask)
+{
+ cpumask_var_t new_mask;
+ int ret;
+
+ if (!READ_ONCE(p->user_cpus_ptr)) {
+ ret = set_cpus_allowed_ptr(p, mask);
+ /*
+ * If user_cpus_ptr becomes set now, we are racing with
+ * a concurrent sched_setaffinity(). So use the newly
+ * set user_cpus_ptr and retry again.
+ *
+ * TODO: We cannot detect change in the cpumask pointed to
+ * by user_cpus_ptr. We will have to add a sequence number
+ * if such a race needs to be addressed.
+ */
+ if (ret || !READ_ONCE(p->user_cpus_ptr))
+ return ret;
+ }
+
+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ if (copy_user_cpus_mask(p, new_mask) &&
+ cpumask_and(new_mask, new_mask, mask))
+ ret = set_cpus_allowed_ptr(p, new_mask);
+ else
+ ret = set_cpus_allowed_ptr(p, mask);
+
+ free_cpumask_var(new_mask);
+ return ret;
+}
+
#ifdef CONFIG_SMP
/*
* Helper routine for generate_sched_domains().
@@ -1130,7 +1168,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it)))
- set_cpus_allowed_ptr(task, cs->effective_cpus);
+ cpuset_set_cpus_allowed_ptr(task, cs->effective_cpus);
css_task_iter_end(&it);
}
@@ -2303,7 +2341,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
*/
- WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+ WARN_ON_ONCE(cpuset_set_cpus_allowed_ptr(task, cpus_attach));
cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
cpuset_update_task_spread_flag(cs, task);
--
2.31.1
On Tue, Aug 16, 2022 at 03:27:34PM -0400, Waiman Long wrote:
> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
> + const struct cpumask *mask)
> +{
> + cpumask_var_t new_mask;
> + int ret;
> +
> + if (!READ_ONCE(p->user_cpus_ptr)) {
> + ret = set_cpus_allowed_ptr(p, mask);
> + /*
> + * If user_cpus_ptr becomes set now, we are racing with
> + * a concurrent sched_setaffinity(). So use the newly
> + * set user_cpus_ptr and retry again.
> + *
> + * TODO: We cannot detect change in the cpumask pointed to
> + * by user_cpus_ptr. We will have to add a sequence number
> + * if such a race needs to be addressed.
> + */
This is too ugly and obviously broken. Let's please do it properly.
Thanks.
--
tejun
On 8/16/22 16:15, Tejun Heo wrote:
> On Tue, Aug 16, 2022 at 03:27:34PM -0400, Waiman Long wrote:
>> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
>> + const struct cpumask *mask)
>> +{
>> + cpumask_var_t new_mask;
>> + int ret;
>> +
>> + if (!READ_ONCE(p->user_cpus_ptr)) {
>> + ret = set_cpus_allowed_ptr(p, mask);
>> + /*
>> + * If user_cpus_ptr becomes set now, we are racing with
>> + * a concurrent sched_setaffinity(). So use the newly
>> + * set user_cpus_ptr and retry again.
>> + *
>> + * TODO: We cannot detect change in the cpumask pointed to
>> + * by user_cpus_ptr. We will have to add a sequence number
>> + * if such a race needs to be addressed.
>> + */
> This is too ugly and obviously broken. Let's please do it properly.
Actually, there is similar construct in __sched_setaffinity():
again:
retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
if (retval)
goto out_free_new_mask;
cpuset_cpus_allowed(p, cpus_allowed);
if (!cpumask_subset(new_mask, cpus_allowed)) {
/*
* We must have raced with a concurrent cpuset update.
* Just reset the cpumask to the cpuset's cpus_allowed.
*/
cpumask_copy(new_mask, cpus_allowed);
goto again;
}
It is hard to synchronize different subsystems atomically without
running into locking issue. Let me think about what can be done in this
case.
Is using a sequence number to check for race with retry good enough?
Cheers,
Longman
Hello, On Tue, Aug 16, 2022 at 06:11:03PM -0400, Waiman Long wrote: > It is hard to synchronize different subsystems atomically without running > into locking issue. Let me think about what can be done in this case. I have a hard time seeing why this would be particularly difficult. cpuset just needs to make the latest cpumask available to sched core in an easily accessible form and whenever that changes, trigger a set_cpus_allowed call. There's no need to entangle operations across the whole subsystems. All that's needed to be communicated is the current cpumask. > Is using a sequence number to check for race with retry good enough? It seems unnecessarily fragile and complicated to me. If we're gonna change it, let's change it right. Thanks. -- tejun
On 8/16/22 18:19, Tejun Heo wrote: > Hello, > > On Tue, Aug 16, 2022 at 06:11:03PM -0400, Waiman Long wrote: >> It is hard to synchronize different subsystems atomically without running >> into locking issue. Let me think about what can be done in this case. > I have a hard time seeing why this would be particularly difficult. cpuset > just needs to make the latest cpumask available to sched core in an easily > accessible form and whenever that changes, trigger a set_cpus_allowed call. > There's no need to entangle operations across the whole subsystems. All > that's needed to be communicated is the current cpumask. > >> Is using a sequence number to check for race with retry good enough? > It seems unnecessarily fragile and complicated to me. If we're gonna change > it, let's change it right. Thanks for the suggestion. I think I get what you want. I am going to migrate the cpuset_set_cpus_allowed_ptr() logic into set_cpus_allowed_ptr() itself. IOW, if user_cpus_ptr is defined, it will be an additional mask to be applied on top. It does affect all callers of set_cpus_allowed_ptr() though. I am going to drop this cpuset specific patch. BTW, I will be on PTO starting tomorrow until next Tuesday. So I will be slow in responding to emails. Cheers, Longman Cheers, Longman
© 2016 - 2026 Red Hat, Inc.