[PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left

Waiman Long posted 1 patch 2 months, 2 weeks ago
kernel/sched/core.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
[PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Waiman Long 2 months, 2 weeks ago
Chen Ridong reported that cpuset could report a kernel warning for a task
due to set_cpus_allowed_ptr() returning failure in the corner case that:

1) the task used sched_setaffinity(2) to set its CPU affinity mask to
   be the same as the cpuset.cpus of its cpuset,
2) all the CPUs assigned to that cpuset were taken offline, and
3) cpuset v1 is in use and the task had to be migrated to the top cpuset.

Due to the fact that CPU affinity of the tasks in the top cpuset are
not updated when a CPU hotplug online/offline event happens, offline
CPUs are included in CPU affinity of those tasks. It is possible
that further masking with user_cpus_ptr set by sched_setaffinity(2)
in __set_cpus_allowed_ptr() will leave only offline CPUs in the new
mask causing the subsequent call to __set_cpus_allowed_ptr_locked()
to return failure with an empty CPU affinity.

Fix this failure by skipping user_cpus_ptr masking if there is no online
CPU left.

Reported-by: Chen Ridong <chenridong@huaweicloud.com>
Closes: https://lore.kernel.org/lkml/20250714032311.3570157-1-chenridong@huaweicloud.com/
Fixes: da019032819a ("sched: Enforce user requested affinity")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81c6df746df1..208f8af73134 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3170,12 +3170,13 @@ int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx)
 
 	rq = task_rq_lock(p, &rf);
 	/*
-	 * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
-	 * flags are set.
+	 * Masking should be skipped if SCA_USER, any of the SCA_MIGRATE_*
+	 * flags are set or no online CPU left.
 	 */
 	if (p->user_cpus_ptr &&
 	    !(ctx->flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
-	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr))
+	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr) &&
+	    cpumask_intersects(rq->scratch_mask, cpu_active_mask))
 		ctx->new_mask = rq->scratch_mask;
 
 	return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);
-- 
2.50.0
Re: [PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Chen Ridong 2 months, 2 weeks ago

On 2025/7/19 0:48, Waiman Long wrote:
> Chen Ridong reported that cpuset could report a kernel warning for a task
> due to set_cpus_allowed_ptr() returning failure in the corner case that:
> 
> 1) the task used sched_setaffinity(2) to set its CPU affinity mask to
>    be the same as the cpuset.cpus of its cpuset,
> 2) all the CPUs assigned to that cpuset were taken offline, and
> 3) cpuset v1 is in use and the task had to be migrated to the top cpuset.
> 
> Due to the fact that CPU affinity of the tasks in the top cpuset are
> not updated when a CPU hotplug online/offline event happens, offline
> CPUs are included in CPU affinity of those tasks. It is possible
> that further masking with user_cpus_ptr set by sched_setaffinity(2)
> in __set_cpus_allowed_ptr() will leave only offline CPUs in the new
> mask causing the subsequent call to __set_cpus_allowed_ptr_locked()
> to return failure with an empty CPU affinity.
> 
> Fix this failure by skipping user_cpus_ptr masking if there is no online
> CPU left.
> 
> Reported-by: Chen Ridong <chenridong@huaweicloud.com>
> Closes: https://lore.kernel.org/lkml/20250714032311.3570157-1-chenridong@huaweicloud.com/
> Fixes: da019032819a ("sched: Enforce user requested affinity")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/sched/core.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 81c6df746df1..208f8af73134 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3170,12 +3170,13 @@ int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx)
>  
>  	rq = task_rq_lock(p, &rf);
>  	/*
> -	 * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
> -	 * flags are set.
> +	 * Masking should be skipped if SCA_USER, any of the SCA_MIGRATE_*
> +	 * flags are set or no online CPU left.
>  	 */
>  	if (p->user_cpus_ptr &&
>  	    !(ctx->flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
> -	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr))
> +	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr) &&
> +	    cpumask_intersects(rq->scratch_mask, cpu_active_mask))
>  		ctx->new_mask = rq->scratch_mask;
>  
>  	return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);

Tested-by:  Chen Ridong <chenridong@huawei.com>
Re: [PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Chen Ridong 2 months, 1 week ago

On 2025/7/23 9:58, Chen Ridong wrote:
> 
> 
> On 2025/7/19 0:48, Waiman Long wrote:
>> Chen Ridong reported that cpuset could report a kernel warning for a task
>> due to set_cpus_allowed_ptr() returning failure in the corner case that:
>>
>> 1) the task used sched_setaffinity(2) to set its CPU affinity mask to
>>    be the same as the cpuset.cpus of its cpuset,
>> 2) all the CPUs assigned to that cpuset were taken offline, and
>> 3) cpuset v1 is in use and the task had to be migrated to the top cpuset.
>>
>> Due to the fact that CPU affinity of the tasks in the top cpuset are
>> not updated when a CPU hotplug online/offline event happens, offline
>> CPUs are included in CPU affinity of those tasks. It is possible
>> that further masking with user_cpus_ptr set by sched_setaffinity(2)
>> in __set_cpus_allowed_ptr() will leave only offline CPUs in the new
>> mask causing the subsequent call to __set_cpus_allowed_ptr_locked()
>> to return failure with an empty CPU affinity.
>>
>> Fix this failure by skipping user_cpus_ptr masking if there is no online
>> CPU left.
>>
>> Reported-by: Chen Ridong <chenridong@huaweicloud.com>
>> Closes: https://lore.kernel.org/lkml/20250714032311.3570157-1-chenridong@huaweicloud.com/
>> Fixes: da019032819a ("sched: Enforce user requested affinity")
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/sched/core.c | 7 ++++---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 81c6df746df1..208f8af73134 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3170,12 +3170,13 @@ int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx)
>>  
>>  	rq = task_rq_lock(p, &rf);
>>  	/*
>> -	 * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
>> -	 * flags are set.
>> +	 * Masking should be skipped if SCA_USER, any of the SCA_MIGRATE_*
>> +	 * flags are set or no online CPU left.
>>  	 */
>>  	if (p->user_cpus_ptr &&
>>  	    !(ctx->flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
>> -	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr))
>> +	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr) &&
>> +	    cpumask_intersects(rq->scratch_mask, cpu_active_mask))
>>  		ctx->new_mask = rq->scratch_mask;
>>  
>>  	return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);
> 
> Tested-by:  Chen Ridong <chenridong@huawei.com>
> 

Friendly ping.

Best regards,
Ridong
Re: [PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Chen Ridong 1 month, 3 weeks ago

On 2025/7/31 20:03, Chen Ridong wrote:
> 
> 
> On 2025/7/23 9:58, Chen Ridong wrote:
>>
>>
>> On 2025/7/19 0:48, Waiman Long wrote:
>>> Chen Ridong reported that cpuset could report a kernel warning for a task
>>> due to set_cpus_allowed_ptr() returning failure in the corner case that:
>>>
>>> 1) the task used sched_setaffinity(2) to set its CPU affinity mask to
>>>    be the same as the cpuset.cpus of its cpuset,
>>> 2) all the CPUs assigned to that cpuset were taken offline, and
>>> 3) cpuset v1 is in use and the task had to be migrated to the top cpuset.
>>>
>>> Due to the fact that CPU affinity of the tasks in the top cpuset are
>>> not updated when a CPU hotplug online/offline event happens, offline
>>> CPUs are included in CPU affinity of those tasks. It is possible
>>> that further masking with user_cpus_ptr set by sched_setaffinity(2)
>>> in __set_cpus_allowed_ptr() will leave only offline CPUs in the new
>>> mask causing the subsequent call to __set_cpus_allowed_ptr_locked()
>>> to return failure with an empty CPU affinity.
>>>
>>> Fix this failure by skipping user_cpus_ptr masking if there is no online
>>> CPU left.
>>>
>>> Reported-by: Chen Ridong <chenridong@huaweicloud.com>
>>> Closes: https://lore.kernel.org/lkml/20250714032311.3570157-1-chenridong@huaweicloud.com/
>>> Fixes: da019032819a ("sched: Enforce user requested affinity")
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>  kernel/sched/core.c | 7 ++++---
>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 81c6df746df1..208f8af73134 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3170,12 +3170,13 @@ int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx)
>>>  
>>>  	rq = task_rq_lock(p, &rf);
>>>  	/*
>>> -	 * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
>>> -	 * flags are set.
>>> +	 * Masking should be skipped if SCA_USER, any of the SCA_MIGRATE_*
>>> +	 * flags are set or no online CPU left.
>>>  	 */
>>>  	if (p->user_cpus_ptr &&
>>>  	    !(ctx->flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
>>> -	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr))
>>> +	    cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr) &&
>>> +	    cpumask_intersects(rq->scratch_mask, cpu_active_mask))
>>>  		ctx->new_mask = rq->scratch_mask;
>>>  
>>>  	return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);
>>
>> Tested-by:  Chen Ridong <chenridong@huawei.com>
>>
> 
> Friendly ping.
> 
> Best regards,
> Ridong
> 

Could someone please review this patch?

-- 
Best regards,
Ridong
Re: [PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Michal Koutný 2 months, 2 weeks ago
On Fri, Jul 18, 2025 at 12:48:56PM -0400, Waiman Long <longman@redhat.com> wrote:
> Chen Ridong reported that cpuset could report a kernel warning for a task
> due to set_cpus_allowed_ptr() returning failure in the corner case that:
> 
> 1) the task used sched_setaffinity(2) to set its CPU affinity mask to
>    be the same as the cpuset.cpus of its cpuset,
> 2) all the CPUs assigned to that cpuset were taken offline, and
> 3) cpuset v1 is in use and the task had to be migrated to the top cpuset.

Does this make sense for cpuset v2 (or no cpuset at all for that matter)?
I'm asking whether this mask modification could only be extracted into
cpuset-v1.c (like cgroup_tranfer_tasks() or a new function)

Thanks,
Michal
Re: [PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Waiman Long 2 months, 2 weeks ago
On 7/21/25 11:13 AM, Michal Koutný wrote:
> On Fri, Jul 18, 2025 at 12:48:56PM -0400, Waiman Long <longman@redhat.com> wrote:
>> Chen Ridong reported that cpuset could report a kernel warning for a task
>> due to set_cpus_allowed_ptr() returning failure in the corner case that:
>>
>> 1) the task used sched_setaffinity(2) to set its CPU affinity mask to
>>     be the same as the cpuset.cpus of its cpuset,
>> 2) all the CPUs assigned to that cpuset were taken offline, and
>> 3) cpuset v1 is in use and the task had to be migrated to the top cpuset.
> Does this make sense for cpuset v2 (or no cpuset at all for that matter)?
> I'm asking whether this mask modification could only be extracted into
> cpuset-v1.c (like cgroup_tranfer_tasks() or a new function)

This corner case as specified in Chen Ridong's patch only happens with a 
cpuset v1 environment, but it is still the case that the default cpu 
affinity of the root cgroup (with or without CONFIG_CGROUPS) will 
include offline CPUs, if present. So it still make senses to skip the 
sched_setaffinity() setting if there is no online CPU left, though it 
will be much harder to have such a condition without using cpuset v1.

Cheers,
Longman

Re: [PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Michal Koutný 1 month, 1 week ago
Hi.

I had a look after a while (thanks for reminders Ridong).

On Mon, Jul 21, 2025 at 11:28:15AM -0400, Waiman Long <llong@redhat.com> wrote:
> This corner case as specified in Chen Ridong's patch only happens with a
> cpuset v1 environment, but it is still the case that the default cpu
> affinity of the root cgroup (with or without CONFIG_CGROUPS) will include
> offline CPUs, if present.

IIUC, the generic sched_setaffinity(2) is ready for that, simply
returning an EINVAL.

> So it still make senses to skip the sched_setaffinity() setting if
> there is no online CPU left, though it will be much harder to have
> such a condition without using cpuset v1.

That sounds like there'd be no issue without cpuset v1 and the source of
the warning has quite a telling comment: 

	 * fail.  TODO: have a better way to handle failure here
	 */
	WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));

The trouble is that this is from cpuset_attach() (cgroup_subsys.attach)
where no errors are expected. So I'd say the place for the check should
be earlier in cpuset_can_attach() [1]. I'm not sure if that's universally
immune against cpu offlining but it'd be sufficient for the reported
sequential offlining.

HTH,
Michal

[1] Although the error propagates, it ends up without recovery in
remove_tasks_in_empty_cpuset() "only" as an error message. But that's
likely all what can be done in this workfn context -- it's better than
silently skipping the migration as consequence of this patch.
Re: [PATCH v3] sched/core: Skip user_cpus_ptr masking if no online CPU left
Posted by Waiman Long 1 month, 1 week ago
On 8/26/25 10:25 AM, Michal Koutný wrote:
> Hi.
>
> I had a look after a while (thanks for reminders Ridong).
>
> On Mon, Jul 21, 2025 at 11:28:15AM -0400, Waiman Long <llong@redhat.com> wrote:
>> This corner case as specified in Chen Ridong's patch only happens with a
>> cpuset v1 environment, but it is still the case that the default cpu
>> affinity of the root cgroup (with or without CONFIG_CGROUPS) will include
>> offline CPUs, if present.
> IIUC, the generic sched_setaffinity(2) is ready for that, simply
> returning an EINVAL.

The modified code will not be executed when called from 
sched_setaffiity() as the SCA_USER flag will be set.

In the described scenario, sched_setaffinity() was called without 
failure as the request was valid at the time.

>
>> So it still make senses to skip the sched_setaffinity() setting if
>> there is no online CPU left, though it will be much harder to have
>> such a condition without using cpuset v1.
> That sounds like there'd be no issue without cpuset v1 and the source of
> the warning has quite a telling comment:
>
> 	 * fail.  TODO: have a better way to handle failure here
> 	 */
> 	WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
>
> The trouble is that this is from cpuset_attach() (cgroup_subsys.attach)
> where no errors are expected. So I'd say the place for the check should
> be earlier in cpuset_can_attach() [1]. I'm not sure if that's universally
> immune against cpu offlining but it'd be sufficient for the reported
> sequential offlining.

Cpuset1 has no concept of effective cpumask  that excludes offline CPUs 
unless "cpuset_v2_mode" mount option is used. So when the cpuset has no 
CPU left, it will force migrate the tasks to its parent and the 
__set_cpus_allowed_ptr() function will be invoked. The parent will 
likely have those offline CPUs in their cpus_allowed list and 
__set_cpus_allowed_ptr_locked() will be called with only the offline 
CPUs causing the warning. Migrating to the top_cpuset is probably not 
needed to illustrate the problem.

Cheers,
Longman

> HTH,
> Michal
>
> [1] Although the error propagates, it ends up without recovery in
> remove_tasks_in_empty_cpuset() "only" as an error message. But that's
> likely all what can be done in this workfn context -- it's better than
> silently skipping the migration as consequence of this patch.