kernel/cgroup/cpuset.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask()
on top_cpuset") enabled us to pull CPUs dedicated to child partitions
from tasks in top_cpuset by ignoring per cpu kthreads. However, there
can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
flag set to indicate that we shouldn't mess with their CPU affinity.
For other kthreads, their affinity will be changed to skip CPUs dedicated
to child partitions whether it is an isolating or a scheduling one.
As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
Fix this issue by dropping the kthread_is_per_cpu() check and checking
the PF_NO_SETAFFINITY flag instead.
Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d0143b3dce47..967603300ee3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1130,9 +1130,11 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
if (top_cs) {
/*
- * Percpu kthreads in top_cpuset are ignored
+ * PF_NO_SETAFFINITY tasks are ignored.
+ * All per cpu kthreads should have PF_NO_SETAFFINITY
+ * flag set, see kthread_set_per_cpu().
*/
- if (kthread_is_per_cpu(task))
+ if (task->flags & PF_NO_SETAFFINITY)
continue;
cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
} else {
--
2.49.0
On Thu, May 08, 2025 at 03:24:13PM -0400, Waiman Long wrote:
> Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask()
> on top_cpuset") enabled us to pull CPUs dedicated to child partitions
> from tasks in top_cpuset by ignoring per cpu kthreads. However, there
> can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
> flag set to indicate that we shouldn't mess with their CPU affinity.
> For other kthreads, their affinity will be changed to skip CPUs dedicated
> to child partitions whether it is an isolating or a scheduling one.
>
> As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
> PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
> Fix this issue by dropping the kthread_is_per_cpu() check and checking
> the PF_NO_SETAFFINITY flag instead.
>
> Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
> Signed-off-by: Waiman Long <longman@redhat.com>
Applied to cgroup/for-6.15-fixes.
Thanks.
--
tejun
Le Thu, May 08, 2025 at 03:24:13PM -0400, Waiman Long a écrit :
> Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask()
> on top_cpuset") enabled us to pull CPUs dedicated to child partitions
> from tasks in top_cpuset by ignoring per cpu kthreads. However, there
> can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
> flag set to indicate that we shouldn't mess with their CPU affinity.
> For other kthreads, their affinity will be changed to skip CPUs dedicated
> to child partitions whether it is an isolating or a scheduling one.
>
> As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
> PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
> Fix this issue by dropping the kthread_is_per_cpu() check and checking
> the PF_NO_SETAFFINITY flag instead.
>
> Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index d0143b3dce47..967603300ee3 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1130,9 +1130,11 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
>
> if (top_cs) {
> /*
> - * Percpu kthreads in top_cpuset are ignored
> + * PF_NO_SETAFFINITY tasks are ignored.
> + * All per cpu kthreads should have PF_NO_SETAFFINITY
> + * flag set, see kthread_set_per_cpu().
> */
> - if (kthread_is_per_cpu(task))
> + if (task->flags & PF_NO_SETAFFINITY)
> continue;
> cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
Acked-by: Frederic Weisbecker <frederic@kernel.org>
But this makes me realize I overlooked that when I introduced the unbound kthreads
centralized affinity.
cpuset_update_tasks_cpumask() seem to blindly affine to subpartitions_cpus
while unbound kthreads might have their preferences (per-nodes or random cpumasks).
So I need to make that pass through kthread API.
It seems that subpartition_cpus doesn't contain nohz_full= CPUs.
But it excludes isolcpus=. And it's usually sane to assume that
nohz_full= CPUs are isolated.
I think I can just rename update_unbound_workqueue_cpumask()
to update_unbound_kthreads_cpumask() and then handle unbound
kthreads from there along with workqueues. And then completely
ignore kthreads from cpuset_update_tasks_cpumask().
Let me think about it (but feel free to apply the current patch meanwhile).
Thanks.
--
Frederic Weisbecker
SUSE Labs
Hello, On Fri, May 09, 2025 at 03:18:17PM +0200, Frederic Weisbecker wrote: ... > But this makes me realize I overlooked that when I introduced the unbound kthreads > centralized affinity. > > cpuset_update_tasks_cpumask() seem to blindly affine to subpartitions_cpus > while unbound kthreads might have their preferences (per-nodes or random cpumasks). > > So I need to make that pass through kthread API. I wonder whether it'd be cleaner if all kthread affinity restrictions go through housekeeping instead of cpuset modifying the cpumasks directly so that housekeeping keeps track of where different classes of kthreads can run and tell e.g. workqueue what to do. Thanks. -- tejun
Le Fri, May 09, 2025 at 07:30:51AM -1000, Tejun Heo a écrit : > Hello, > > On Fri, May 09, 2025 at 03:18:17PM +0200, Frederic Weisbecker wrote: > ... > > But this makes me realize I overlooked that when I introduced the unbound kthreads > > centralized affinity. > > > > cpuset_update_tasks_cpumask() seem to blindly affine to subpartitions_cpus > > while unbound kthreads might have their preferences (per-nodes or random cpumasks). > > > > So I need to make that pass through kthread API. > > I wonder whether it'd be cleaner if all kthread affinity restrictions go > through housekeeping instead of cpuset modifying the cpumasks directly so > that housekeeping keeps track of where different classes of kthreads can run > and tell e.g. workqueue what to do. Good suggestion. "isolated_cpus" should indeed be handled by housekeeping itself. More precisely housekeeping_cpu(HK_TYPE_DOMAIN) should be updated through some housekeeping_update() function to union the boot 'isolcpus=' and the isolated mask of cpusets partition. Waiman tried that at some point. This will require some synchronization against the readers of HK_TYPE_DOMAIN. It's beyond the scope of the kthreads affinity issue but yes that's all planned within the cpusets integration of nohz_full. Thanks. > > Thanks. > > -- > tejun -- Frederic Weisbecker SUSE Labs
On 5/9/25 9:18 AM, Frederic Weisbecker wrote:
> Le Thu, May 08, 2025 at 03:24:13PM -0400, Waiman Long a écrit :
>> Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask()
>> on top_cpuset") enabled us to pull CPUs dedicated to child partitions
>> from tasks in top_cpuset by ignoring per cpu kthreads. However, there
>> can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
>> flag set to indicate that we shouldn't mess with their CPU affinity.
>> For other kthreads, their affinity will be changed to skip CPUs dedicated
>> to child partitions whether it is an isolating or a scheduling one.
>>
>> As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
>> PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
>> Fix this issue by dropping the kthread_is_per_cpu() check and checking
>> the PF_NO_SETAFFINITY flag instead.
>>
>> Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index d0143b3dce47..967603300ee3 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1130,9 +1130,11 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
>>
>> if (top_cs) {
>> /*
>> - * Percpu kthreads in top_cpuset are ignored
>> + * PF_NO_SETAFFINITY tasks are ignored.
>> + * All per cpu kthreads should have PF_NO_SETAFFINITY
>> + * flag set, see kthread_set_per_cpu().
>> */
>> - if (kthread_is_per_cpu(task))
>> + if (task->flags & PF_NO_SETAFFINITY)
>> continue;
>> cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
> Acked-by: Frederic Weisbecker <frederic@kernel.org>
>
> But this makes me realize I overlooked that when I introduced the unbound kthreads
> centralized affinity.
>
> cpuset_update_tasks_cpumask() seem to blindly affine to subpartitions_cpus
> while unbound kthreads might have their preferences (per-nodes or random cpumasks).
>
> So I need to make that pass through kthread API.
AFAIU, the kthread_bind_mask() or the kthread_bin_cpu() functions will
set PF_NO_SETAFFINITY.
>
> It seems that subpartition_cpus doesn't contain nohz_full= CPUs.
> But it excludes isolcpus=. And it's usually sane to assume that
> nohz_full= CPUs are isolated.
Most users that want isolated CPUs will set both isolcpus and nohz_full
to the same set of CPUs. I do see that RH OpenShift can set nohz_full
for a collection of CPUs that may be dynamically isolated later on via
cpuset partition.
>
> I think I can just rename update_unbound_workqueue_cpumask()
> to update_unbound_kthreads_cpumask() and then handle unbound
> kthreads from there along with workqueues. And then completely
> ignore kthreads from cpuset_update_tasks_cpumask().
I guess we can do that. Right now, update_unbound_workqueue_cpumask() is
only called to excluded isolated CPUs. The
cpuset_update_tasks_cpumasks() will updated affinity for both isolated
and scheduling partitions. I agree that there is code duplication here.
To suit Xi Wang use case, we may have to add a sysctl parameter, for
instance, to decide if we have to update unbound kthreads in the
scheduling partition case.
Cheers,
Longman
> Let me think about it (but feel free to apply the current patch meanwhile).
>
> Thanks.
>
© 2016 - 2025 Red Hat, Inc.