[v4] sched, cgroup/cpuset: Keep user set cpus affinity

[PATCH v4 3/3] cgroup/cpuset: Keep user set cpus affinity

Posted by Waiman Long 3 years, 7 months ago

It was found that any change to the current cpuset hierarchy may reset
the cpumask of the tasks in the affected cpusets to the default cpuset
value even if those tasks have cpus affinity explicitly set by the users
before. That is especially easy to trigger under a cgroup v2 environment
where writing "+cpuset" to the root cgroup's cgroup.subtree_control
file will reset the cpus affinity of all the processes in the system.

That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.

Fix this problem by looking at user_cpus_ptr which will be set if
cpus affinity have been explicitly set before and use it to restrcit
the given cpumask unless there is no overlap. In that case, it will
fallback to the given one.

With that change in place, it was verified that tasks that have its
cpus affinity explicitly set will not be affected by changes made to
the v2 cgroup.subtree_control files.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 58aadfda9b8b..cabfac540fd8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -704,6 +704,30 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 	return ret;
 }
 
+/*
+ * Preserve user provided cpumask (if set) as much as possible unless there
+ * is no overlap with the given mask.
+ */
+static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
+				       const struct cpumask *mask)
+{
+	if (p->user_cpus_ptr) {
+		cpumask_var_t new_mask;
+
+		if (alloc_cpumask_var(&new_mask, GFP_KERNEL) &&
+		    copy_user_cpus_mask(p, new_mask) &&
+		    cpumask_and(new_mask, new_mask, mask)) {
+			int ret = set_cpus_allowed_ptr(p, new_mask);
+
+			free_cpumask_var(new_mask);
+			return ret;
+		}
+		free_cpumask_var(new_mask);
+	}
+
+	return set_cpus_allowed_ptr(p, mask);
+}
+
 #ifdef CONFIG_SMP
 /*
  * Helper routine for generate_sched_domains().
@@ -1130,7 +1154,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
 
 	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
-		set_cpus_allowed_ptr(task, cs->effective_cpus);
+		cpuset_set_cpus_allowed_ptr(task, cs->effective_cpus);
 	css_task_iter_end(&it);
 }
 
@@ -2303,7 +2327,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 		 * can_attach beforehand should guarantee that this doesn't
 		 * fail.  TODO: have a better way to handle failure here
 		 */
-		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+		WARN_ON_ONCE(cpuset_set_cpus_allowed_ptr(task, cpus_attach));
 
 		cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
 		cpuset_update_task_spread_flag(cs, task);
-- 
2.31.1

Re: [PATCH v4 3/3] cgroup/cpuset: Keep user set cpus affinity

Posted by Tejun Heo 3 years, 7 months ago

Hello,

So, overall I think this is the right direction.

> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
> +				       const struct cpumask *mask)
> +{
> +	if (p->user_cpus_ptr) {
> +		cpumask_var_t new_mask;
> +
> +		if (alloc_cpumask_var(&new_mask, GFP_KERNEL) &&
> +		    copy_user_cpus_mask(p, new_mask) &&
> +		    cpumask_and(new_mask, new_mask, mask)) {
> +			int ret = set_cpus_allowed_ptr(p, new_mask);
> +
> +			free_cpumask_var(new_mask);
> +			return ret;
> +		}
> +		free_cpumask_var(new_mask);
> +	}
> +
> +	return set_cpus_allowed_ptr(p, mask);
> +}

But this seems racy to me. Let's say attach and setaffinity race. The
expectation should be that we'd end up with the same eventual mask no matter
what the operation order may be. The above code wouldn't do that, right?
There's nothing synchronizing the two and if setaffinity takes place between
the user_cpus_ptr test and set_cpus_allowed_ptr(), it'd get ignored.

This gotta be more integrated. There is what the user requested and there
are restrictions from CPU hotplug state and cpuset. All three should be
synchronized so that there is one synchronzied way to obtain and apply the
current effective mask.

Thanks.

-- 
tejun

Re: [PATCH v4 3/3] cgroup/cpuset: Keep user set cpus affinity

Posted by Waiman Long 3 years, 7 months ago

On 8/16/22 13:20, Tejun Heo wrote:
> Hello,
>
> So, overall I think this is the right direction.
>
>> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
>> +				       const struct cpumask *mask)
>> +{
>> +	if (p->user_cpus_ptr) {
>> +		cpumask_var_t new_mask;
>> +
>> +		if (alloc_cpumask_var(&new_mask, GFP_KERNEL) &&
>> +		    copy_user_cpus_mask(p, new_mask) &&
>> +		    cpumask_and(new_mask, new_mask, mask)) {
>> +			int ret = set_cpus_allowed_ptr(p, new_mask);
>> +
>> +			free_cpumask_var(new_mask);
>> +			return ret;
>> +		}
>> +		free_cpumask_var(new_mask);
>> +	}
>> +
>> +	return set_cpus_allowed_ptr(p, mask);
>> +}
> But this seems racy to me. Let's say attach and setaffinity race. The
> expectation should be that we'd end up with the same eventual mask no matter
> what the operation order may be. The above code wouldn't do that, right?
> There's nothing synchronizing the two and if setaffinity takes place between
> the user_cpus_ptr test and set_cpus_allowed_ptr(), it'd get ignored.

Yes, a race like this is possible. To completely eliminate the race may 
require taking task_rq_lock() and then calling 
__set_cpus_allowed_ptr_locked() which is internal to kernel/sched/core.c.

Alternatively, we can check user_cpus_ptr again after the scond 
set_cpus_allowed_ptr() and retry it with the other path if set. That 
will probably address your concern. Please let me know if you are OK 
with that.

Cheers,
Longman

>
> This gotta be more integrated. There is what the user requested and there
> are restrictions from CPU hotplug state and cpuset. All three should be
> synchronized so that there is one synchronzied way to obtain and apply the
> current effective mask.
>
> Thanks.
>

Re: [PATCH v4 3/3] cgroup/cpuset: Keep user set cpus affinity

Posted by Tejun Heo 3 years, 7 months ago

Hello,

On Tue, Aug 16, 2022 at 01:38:17PM -0400, Waiman Long wrote:
> Yes, a race like this is possible. To completely eliminate the race may
> require taking task_rq_lock() and then calling
> __set_cpus_allowed_ptr_locked() which is internal to kernel/sched/core.c.
> 
> Alternatively, we can check user_cpus_ptr again after the scond
> set_cpus_allowed_ptr() and retry it with the other path if set. That will
> probably address your concern. Please let me know if you are OK with that.

I think this would look better if structured the other way around - make the
scheduler side call out to cpuset to query the current restrictions and
apply it atomically.

Thanks.

-- 
tejun

Re: [PATCH v4 3/3] cgroup/cpuset: Keep user set cpus affinity

Posted by Waiman Long 3 years, 7 months ago

On 8/16/22 13:52, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 16, 2022 at 01:38:17PM -0400, Waiman Long wrote:
>> Yes, a race like this is possible. To completely eliminate the race may
>> require taking task_rq_lock() and then calling
>> __set_cpus_allowed_ptr_locked() which is internal to kernel/sched/core.c.
>>
>> Alternatively, we can check user_cpus_ptr again after the scond
>> set_cpus_allowed_ptr() and retry it with the other path if set. That will
>> probably address your concern. Please let me know if you are OK with that.
> I think this would look better if structured the other way around - make the
> scheduler side call out to cpuset to query the current restrictions and
> apply it atomically.

The sched_setaffinity() function does call cpuset_cpus_allowed() to 
apply the cpuset constraint. However, making set_cpus_allowed_ptr() call 
cpuset function is a major change. It will disturb the current locking 
sequences and may cause circular locking dependency problem. We 
certainly need more time to figure out if that is feasible.

In the mean time, I would prefer to do a retry if user_cpus_ptr status 
changes for now. We can then do a follow up patch to make this 
structural change if there is a consensus of doing so.

Cheers,
Longman