[PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu

Chuyi Zhou posted 3 patches 1 year ago
[PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu
Posted by Chuyi Zhou 1 year ago
Now in task_numa_find_cpu(), we only skip CPUs that are not in the task's
cpumask, which could result in migrating the task to an isolated domain if
the task's cpumask includes isolated CPUs. This is because cpuset
configured partitions are always reflected in each member task's cpumask.
However, for isolcpus= kernel command line option, the isolated CPUs are
simply omitted from sched_domains without further restrictions on tasks'
cpumasks.

This change replaces the set of CPUs allowed to migrate the task from
p->cpus_ptr by the intersection of p->cpus_ptr and
housekeeping_cpumask(HK_TYPE_DOMAIN).

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/sched/fair.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0139659fe7a..05782b563609 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2081,6 +2081,12 @@ numa_type numa_classify(unsigned int imbalance_pct,
 	return node_fully_busy;
 }
 
+static inline bool numa_migrate_test_cpu(struct task_struct *p, int cpu)
+{
+	return cpumask_test_cpu(cpu, p->cpus_ptr) &&
+			housekeeping_cpu(cpu, HK_TYPE_DOMAIN);
+}
+
 #ifdef CONFIG_SCHED_SMT
 /* Forward declarations of select_idle_sibling helpers */
 static inline bool test_idle_cores(int cpu);
@@ -2168,7 +2174,7 @@ static void task_numa_assign(struct task_numa_env *env,
 		/* Find alternative idle CPU. */
 		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start + 1) {
 			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
-			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+			    !numa_migrate_test_cpu(env->p, cpu)) {
 				continue;
 			}
 
@@ -2480,7 +2486,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
-		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
+		if (!numa_migrate_test_cpu(env->p, cpu))
 			continue;
 
 		env->dst_cpu = cpu;
-- 
2.20.1
Re: [PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu
Posted by K Prateek Nayak 12 months ago
Hello Chuyi,

On 12/16/2024 5:53 PM, Chuyi Zhou wrote:
> [..snip..]
> @@ -2081,6 +2081,12 @@ numa_type numa_classify(unsigned int imbalance_pct,
>   	return node_fully_busy;
>   }
>   
> +static inline bool numa_migrate_test_cpu(struct task_struct *p, int cpu)
> +{
> +	return cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +			housekeeping_cpu(cpu, HK_TYPE_DOMAIN);
> +}
> +
>   #ifdef CONFIG_SCHED_SMT
>   /* Forward declarations of select_idle_sibling helpers */
>   static inline bool test_idle_cores(int cpu);
> @@ -2168,7 +2174,7 @@ static void task_numa_assign(struct task_numa_env *env,
>   		/* Find alternative idle CPU. */
>   		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start + 1) {

Can we just do:

	for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), housekeeping_cpumask(HK_TYPE_DOMAIN)) {
		...
	}

and avoid adding numa_migrate_test_cpu(). Thoughts?

>   			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
> -			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
> +			    !numa_migrate_test_cpu(env->p, cpu)) {
>   				continue;
>   			}
>   
> @@ -2480,7 +2486,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>   
>   	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {

Same modifications can be made for this outer loop.

-- 
Thanks and Regards,
Prateek

>   		/* Skip this CPU if the source task cannot migrate */
> -		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
> +		if (!numa_migrate_test_cpu(env->p, cpu))
>   			continue;
>   
>   		env->dst_cpu = cpu;
Re: [PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu
Posted by Chuyi Zhou 11 months, 4 weeks ago

在 2024/12/18 14:21, K Prateek Nayak 写道:
> Hello Chuyi,
> 
> On 12/16/2024 5:53 PM, Chuyi Zhou wrote:
>> [..snip..]
>> @@ -2081,6 +2081,12 @@ numa_type numa_classify(unsigned int 
>> imbalance_pct,
>>       return node_fully_busy;
>>   }
>> +static inline bool numa_migrate_test_cpu(struct task_struct *p, int cpu)
>> +{
>> +    return cpumask_test_cpu(cpu, p->cpus_ptr) &&
>> +            housekeeping_cpu(cpu, HK_TYPE_DOMAIN);
>> +}
>> +
>>   #ifdef CONFIG_SCHED_SMT
>>   /* Forward declarations of select_idle_sibling helpers */
>>   static inline bool test_idle_cores(int cpu);
>> @@ -2168,7 +2174,7 @@ static void task_numa_assign(struct 
>> task_numa_env *env,
>>           /* Find alternative idle CPU. */
>>           for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start 
>> + 1) {
> 
> Can we just do:
> 
>      for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), 
> housekeeping_cpumask(HK_TYPE_DOMAIN)) {
>          ...
>      }
> 
> and avoid adding numa_migrate_test_cpu(). Thoughts?

Make sense, but now there doesn't seem to be an API like 
for_each_cpu_wrap_and().

Do you think the following is better?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 855df103f4dd..4792ef672738 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2167,9 +2167,9 @@ static void task_numa_assign(struct task_numa_env 
*env,
                 int start = env->dst_cpu;

                 /* Find alternative idle CPU. */
-               for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), 
start + 1) {
+               for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), 
housekeeping_cpumask(HK_TYPE_DOMAIN)) {
                         if (cpu == env->best_cpu || !idle_cpu(cpu) ||
-                           !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+                               cpu == start || !cpumask_test_cpu(cpu, 
env->p->cpus_ptr)) {
                                 continue;
                         }


Thanks.


> 
>>               if (cpu == env->best_cpu || !idle_cpu(cpu) ||
>> -                !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
>> +                !numa_migrate_test_cpu(env->p, cpu)) {
>>                   continue;
>>               }
>> @@ -2480,7 +2486,7 @@ static void task_numa_find_cpu(struct 
>> task_numa_env *env,
>>       for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
> 
> Same modifications can be made for this outer loop.
> 

Re: [PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu
Posted by K Prateek Nayak 11 months, 3 weeks ago
Hello Chuyi,

On 12/23/2024 6:28 PM, Chuyi Zhou wrote:
> 
> 
> 在 2024/12/18 14:21, K Prateek Nayak 写道:
>> Hello Chuyi,
>>
>> On 12/16/2024 5:53 PM, Chuyi Zhou wrote:
>>> [..snip..]
>>> @@ -2081,6 +2081,12 @@ numa_type numa_classify(unsigned int imbalance_pct,
>>>       return node_fully_busy;
>>>   }
>>> +static inline bool numa_migrate_test_cpu(struct task_struct *p, int cpu)
>>> +{
>>> +    return cpumask_test_cpu(cpu, p->cpus_ptr) &&
>>> +            housekeeping_cpu(cpu, HK_TYPE_DOMAIN);
>>> +}
>>> +
>>>   #ifdef CONFIG_SCHED_SMT
>>>   /* Forward declarations of select_idle_sibling helpers */
>>>   static inline bool test_idle_cores(int cpu);
>>> @@ -2168,7 +2174,7 @@ static void task_numa_assign(struct task_numa_env *env,
>>>           /* Find alternative idle CPU. */
>>>           for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start + 1) {
>>
>> Can we just do:
>>
>>      for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), housekeeping_cpumask(HK_TYPE_DOMAIN)) {
>>          ...
>>      }
>>
>> and avoid adding numa_migrate_test_cpu(). Thoughts?
> 
> Make sense, but now there doesn't seem to be an API like for_each_cpu_wrap_and().
> 
> Do you think the following is better?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 855df103f4dd..4792ef672738 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2167,9 +2167,9 @@ static void task_numa_assign(struct task_numa_env *env,
>                  int start = env->dst_cpu;
> 
>                  /* Find alternative idle CPU. */
> -               for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start + 1) {
> +               for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), housekeeping_cpumask(HK_TYPE_DOMAIN)) {
>                          if (cpu == env->best_cpu || !idle_cpu(cpu) ||

"start" is set to "env->dst_cpu" is already taken care here with the
first comparison.

> -                           !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
> +                               cpu == start || !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
>                                  continue;
>                          }
> 

I think the for_each_cpu_wrap() was used to reduce contention for xchg
operation below. Perhaps we can have a per-cpu temporary mask (like
load_balance_mask) if we want to reduce the xchg contention and break
this into cpumask_and() + for_each_cpu_wrap() steps. I'm not sure if
any of the existing masks (load_balance_mask, select_rq_mask,
should_we_balance_tmpmask) can be safely reused. Otherwise, perhaps we
can make a case for for_each_cpu_and_wrap() with this use case.

> 
> Thanks.
> 
> 
>>
>>>               if (cpu == env->best_cpu || !idle_cpu(cpu) ||
>>> -                !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
>>> +                !numa_migrate_test_cpu(env->p, cpu)) {
>>>                   continue;
>>>               }
>>> @@ -2480,7 +2486,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>       for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
>>
>> Same modifications can be made for this outer loop.
>>
> 

-- 
Thanks and Regards,
Prateek

Re: [PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu
Posted by Chuyi Zhou 11 months, 3 weeks ago
Hello,

在 2024/12/27 12:40, K Prateek Nayak 写道:
> Hello Chuyi,
> 
> On 12/23/2024 6:28 PM, Chuyi Zhou wrote:
>>
>>
>> 在 2024/12/18 14:21, K Prateek Nayak 写道:
>>> Hello Chuyi,
>>>
>>> On 12/16/2024 5:53 PM, Chuyi Zhou wrote:
>>>> [..snip..]
>>>> @@ -2081,6 +2081,12 @@ numa_type numa_classify(unsigned int 
>>>> imbalance_pct,
>>>>       return node_fully_busy;
>>>>   }
>>>> +static inline bool numa_migrate_test_cpu(struct task_struct *p, int 
>>>> cpu)
>>>> +{
>>>> +    return cpumask_test_cpu(cpu, p->cpus_ptr) &&
>>>> +            housekeeping_cpu(cpu, HK_TYPE_DOMAIN);
>>>> +}
>>>> +
>>>>   #ifdef CONFIG_SCHED_SMT
>>>>   /* Forward declarations of select_idle_sibling helpers */
>>>>   static inline bool test_idle_cores(int cpu);
>>>> @@ -2168,7 +2174,7 @@ static void task_numa_assign(struct 
>>>> task_numa_env *env,
>>>>           /* Find alternative idle CPU. */
>>>>           for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), 
>>>> start + 1) {
>>>
>>> Can we just do:
>>>
>>>      for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), 
>>> housekeeping_cpumask(HK_TYPE_DOMAIN)) {
>>>          ...
>>>      }
>>>
>>> and avoid adding numa_migrate_test_cpu(). Thoughts?
>>
>> Make sense, but now there doesn't seem to be an API like 
>> for_each_cpu_wrap_and().
>>
>> Do you think the following is better?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 855df103f4dd..4792ef672738 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2167,9 +2167,9 @@ static void task_numa_assign(struct 
>> task_numa_env *env,
>>                  int start = env->dst_cpu;
>>
>>                  /* Find alternative idle CPU. */
>> -               for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), 
>> start + 1) {
>> +               for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), 
>> housekeeping_cpumask(HK_TYPE_DOMAIN)) {
>>                          if (cpu == env->best_cpu || !idle_cpu(cpu) ||
> 
> "start" is set to "env->dst_cpu" is already taken care here with the
> first comparison.
> 
>> -                           !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
>> +                               cpu == start || !cpumask_test_cpu(cpu, 
>> env->p->cpus_ptr)) {
>>                                  continue;
>>                          }
>>
> 
> I think the for_each_cpu_wrap() was used to reduce contention for xchg
> operation below. Perhaps we can have a per-cpu temporary mask (like
> load_balance_mask) if we want to reduce the xchg contention and break
> this into cpumask_and() + for_each_cpu_wrap() steps. I'm not sure if
> any of the existing masks (load_balance_mask, select_rq_mask,
> should_we_balance_tmpmask) can be safely reused. Otherwise, perhaps we
> can make a case for for_each_cpu_and_wrap() with this use case.
> 


for_each_cpu_and_wrap() is a good idea, but it might be slightly 
off-topic for this subject. Perhaps we should stick with this 
implementation for now and see what others think about v2.


Thanks.

Re: [PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu
Posted by K Prateek Nayak 11 months, 2 weeks ago
Hello Chuyi,

On 12/27/2024 1:29 PM, Chuyi Zhou wrote:
> [..snip..]
>>
>> I think the for_each_cpu_wrap() was used to reduce contention for xchg
>> operation below. Perhaps we can have a per-cpu temporary mask (like
>> load_balance_mask) if we want to reduce the xchg contention and break
>> this into cpumask_and() + for_each_cpu_wrap() steps. I'm not sure if
>> any of the existing masks (load_balance_mask, select_rq_mask,
>> should_we_balance_tmpmask) can be safely reused. Otherwise, perhaps we
>> can make a case for for_each_cpu_and_wrap() with this use case.
>>
> 
> 
> for_each_cpu_and_wrap() is a good idea, but it might be slightly off-topic for this subject. Perhaps we should stick with this implementation for now and see what others think about v2.

Sure thing! No strong feeling from my side :)

> 
> 
> Thanks.
> 

-- 
Thanks and Regards,
Prateek