It was found that any change to the current cpuset hierarchy may reset
the cpumask of the tasks in the affected cpusets to the default cpuset
value even if those tasks have cpus affinity explicitly set by the users
before. That is especially easy to trigger under a cgroup v2 environment
where writing "+cpuset" to the root cgroup's cgroup.subtree_control
file will reset the cpus affinity of all the processes in the system.
That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.
Fix this problem by looking at user_cpus_ptr which will be set if
cpus affinity have been explicitly set before and use it to restrcit
the given cpumask unless there is no overlap. In that case, it will
fallback to the given one.
With that change in place, it was verified that tasks that have its
cpus affinity explicitly set will not be affected by changes made to
the v2 cgroup.subtree_control files.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 58aadfda9b8b..cabfac540fd8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -704,6 +704,30 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
return ret;
}
+/*
+ * Preserve user provided cpumask (if set) as much as possible unless there
+ * is no overlap with the given mask.
+ */
+static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *mask)
+{
+ if (p->user_cpus_ptr) {
+ cpumask_var_t new_mask;
+
+ if (alloc_cpumask_var(&new_mask, GFP_KERNEL) &&
+ copy_user_cpus_mask(p, new_mask) &&
+ cpumask_and(new_mask, new_mask, mask)) {
+ int ret = set_cpus_allowed_ptr(p, new_mask);
+
+ free_cpumask_var(new_mask);
+ return ret;
+ }
+ free_cpumask_var(new_mask);
+ }
+
+ return set_cpus_allowed_ptr(p, mask);
+}
+
#ifdef CONFIG_SMP
/*
* Helper routine for generate_sched_domains().
@@ -1130,7 +1154,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it)))
- set_cpus_allowed_ptr(task, cs->effective_cpus);
+ cpuset_set_cpus_allowed_ptr(task, cs->effective_cpus);
css_task_iter_end(&it);
}
@@ -2303,7 +2327,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
*/
- WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+ WARN_ON_ONCE(cpuset_set_cpus_allowed_ptr(task, cpus_attach));
cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
cpuset_update_task_spread_flag(cs, task);
--
2.31.1
Hello,
So, overall I think this is the right direction.
> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
> + const struct cpumask *mask)
> +{
> + if (p->user_cpus_ptr) {
> + cpumask_var_t new_mask;
> +
> + if (alloc_cpumask_var(&new_mask, GFP_KERNEL) &&
> + copy_user_cpus_mask(p, new_mask) &&
> + cpumask_and(new_mask, new_mask, mask)) {
> + int ret = set_cpus_allowed_ptr(p, new_mask);
> +
> + free_cpumask_var(new_mask);
> + return ret;
> + }
> + free_cpumask_var(new_mask);
> + }
> +
> + return set_cpus_allowed_ptr(p, mask);
> +}
But this seems racy to me. Let's say attach and setaffinity race. The
expectation should be that we'd end up with the same eventual mask no matter
what the operation order may be. The above code wouldn't do that, right?
There's nothing synchronizing the two and if setaffinity takes place between
the user_cpus_ptr test and set_cpus_allowed_ptr(), it'd get ignored.
This gotta be more integrated. There is what the user requested and there
are restrictions from CPU hotplug state and cpuset. All three should be
synchronized so that there is one synchronzied way to obtain and apply the
current effective mask.
Thanks.
--
tejun
On 8/16/22 13:20, Tejun Heo wrote:
> Hello,
>
> So, overall I think this is the right direction.
>
>> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
>> + const struct cpumask *mask)
>> +{
>> + if (p->user_cpus_ptr) {
>> + cpumask_var_t new_mask;
>> +
>> + if (alloc_cpumask_var(&new_mask, GFP_KERNEL) &&
>> + copy_user_cpus_mask(p, new_mask) &&
>> + cpumask_and(new_mask, new_mask, mask)) {
>> + int ret = set_cpus_allowed_ptr(p, new_mask);
>> +
>> + free_cpumask_var(new_mask);
>> + return ret;
>> + }
>> + free_cpumask_var(new_mask);
>> + }
>> +
>> + return set_cpus_allowed_ptr(p, mask);
>> +}
> But this seems racy to me. Let's say attach and setaffinity race. The
> expectation should be that we'd end up with the same eventual mask no matter
> what the operation order may be. The above code wouldn't do that, right?
> There's nothing synchronizing the two and if setaffinity takes place between
> the user_cpus_ptr test and set_cpus_allowed_ptr(), it'd get ignored.
Yes, a race like this is possible. To completely eliminate the race may
require taking task_rq_lock() and then calling
__set_cpus_allowed_ptr_locked() which is internal to kernel/sched/core.c.
Alternatively, we can check user_cpus_ptr again after the scond
set_cpus_allowed_ptr() and retry it with the other path if set. That
will probably address your concern. Please let me know if you are OK
with that.
Cheers,
Longman
>
> This gotta be more integrated. There is what the user requested and there
> are restrictions from CPU hotplug state and cpuset. All three should be
> synchronized so that there is one synchronzied way to obtain and apply the
> current effective mask.
>
> Thanks.
>
Hello, On Tue, Aug 16, 2022 at 01:38:17PM -0400, Waiman Long wrote: > Yes, a race like this is possible. To completely eliminate the race may > require taking task_rq_lock() and then calling > __set_cpus_allowed_ptr_locked() which is internal to kernel/sched/core.c. > > Alternatively, we can check user_cpus_ptr again after the scond > set_cpus_allowed_ptr() and retry it with the other path if set. That will > probably address your concern. Please let me know if you are OK with that. I think this would look better if structured the other way around - make the scheduler side call out to cpuset to query the current restrictions and apply it atomically. Thanks. -- tejun
On 8/16/22 13:52, Tejun Heo wrote: > Hello, > > On Tue, Aug 16, 2022 at 01:38:17PM -0400, Waiman Long wrote: >> Yes, a race like this is possible. To completely eliminate the race may >> require taking task_rq_lock() and then calling >> __set_cpus_allowed_ptr_locked() which is internal to kernel/sched/core.c. >> >> Alternatively, we can check user_cpus_ptr again after the scond >> set_cpus_allowed_ptr() and retry it with the other path if set. That will >> probably address your concern. Please let me know if you are OK with that. > I think this would look better if structured the other way around - make the > scheduler side call out to cpuset to query the current restrictions and > apply it atomically. The sched_setaffinity() function does call cpuset_cpus_allowed() to apply the cpuset constraint. However, making set_cpus_allowed_ptr() call cpuset function is a major change. It will disturb the current locking sequences and may cause circular locking dependency problem. We certainly need more time to figure out if that is feasible. In the mean time, I would prefer to do a retry if user_cpus_ptr status changes for now. We can then do a follow up patch to make this structural change if there is a consensus of doing so. Cheers, Longman
© 2016 - 2026 Red Hat, Inc.