[PATCH 2/3] sched/fair: Ignore isolated cpus in update_numa_stat

Chuyi Zhou posted 3 patches 1 year ago
[PATCH 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
Posted by Chuyi Zhou 1 year ago
Now update_numa_stats() iterates each cpu in a node to gather load
information for the node and attempts to find the idle cpu as a candidate
best_cpu within the node.

In update_numa_stats() we should take into account the scheduling domain.
This is because the "isolcpus" kernel command line option and cpuset iso-
late partitions can remove CPUs from load balance. Similar to task wakeup
and periodic load balancing, we should not involve isolated CPUs in NUMA
balancing. When gathering load information for nodes, we need to ignore the
load of isolated CPUs. This change also avoids selecting an isolated CPU
as the idle_cpu.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f544012b9320..a0139659fe7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2125,6 +2125,11 @@ static void update_numa_stats(struct task_numa_env *env,
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq = cpu_rq(cpu);
 
+		/* skip isolated cpus' load */
+		if (!rcu_dereference(rq->sd))
+			continue;
+
+		ns->weight++;
 		ns->load += cpu_load(rq);
 		ns->runnable += cpu_runnable(rq);
 		ns->util += cpu_util_cfs(cpu);
@@ -2144,8 +2149,6 @@ static void update_numa_stats(struct task_numa_env *env,
 	}
 	rcu_read_unlock();
 
-	ns->weight = cpumask_weight(cpumask_of_node(nid));
-
 	ns->node_type = numa_classify(env->imbalance_pct, ns);
 
 	if (idle_core >= 0)
-- 
2.20.1
Re: [PATCH 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
Posted by K Prateek Nayak 12 months ago
Hello Chuyi,

On 12/16/2024 5:53 PM, Chuyi Zhou wrote:
> [..snip..] 
> @@ -2125,6 +2125,11 @@ static void update_numa_stats(struct task_numa_env *env,
>   	for_each_cpu(cpu, cpumask_of_node(nid)) {

Looking at sched_init_domains(), we only build sched domains only for
active CPUs in housekeeping_cpumask(HK_TYPE_DOMAIN) so similar to the
question on Patch 3, can we get away with just modifying this outer loop
to:

	for_each_cpu_and(cpu, cpumask_of_node(nid), housekeeping_cpumask(HK_TYPE_DOMAIN)) {
		...
	}

Thoughts?

-- 
Thanks and Regards,
Prateek

>   		struct rq *rq = cpu_rq(cpu);
>   
> +		/* skip isolated cpus' load */
> +		if (!rcu_dereference(rq->sd))
> +			continue;
> +
> +		ns->weight++;
>   		ns->load += cpu_load(rq);
>   		ns->runnable += cpu_runnable(rq);
>   		ns->util += cpu_util_cfs(cpu);
> @@ -2144,8 +2149,6 @@ static void update_numa_stats(struct task_numa_env *env,
>   	}
>   	rcu_read_unlock();
>   
> -	ns->weight = cpumask_weight(cpumask_of_node(nid));
> -
>   	ns->node_type = numa_classify(env->imbalance_pct, ns);
>   
>   	if (idle_core >= 0)
Re: [External] Re: [PATCH 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
Posted by Chuyi Zhou 12 months ago
Hello Prateek

在 2024/12/18 14:26, K Prateek Nayak 写道:
> Hello Chuyi,
> 
> On 12/16/2024 5:53 PM, Chuyi Zhou wrote:
>> [..snip..] @@ -2125,6 +2125,11 @@ static void update_numa_stats(struct 
>> task_numa_env *env,
>>       for_each_cpu(cpu, cpumask_of_node(nid)) {
> 
> Looking at sched_init_domains(), we only build sched domains only for
> active CPUs in housekeeping_cpumask(HK_TYPE_DOMAIN) so similar to the
> question on Patch 3, can we get away with just modifying this outer loop
> to:
> 
>      for_each_cpu_and(cpu, cpumask_of_node(nid), 
> housekeeping_cpumask(HK_TYPE_DOMAIN)) {
>          ...
>      }
> 
> Thoughts?
> 

We now have two ways of using isolated CPUs.

One is the isolcpus= kernel command line. 'isolcpus=0-7' would
exclude 0-7 cpus from housekeeping_cpumask without further restrictions 
on tasks' cpumasks. A typical case that could lead to errors is when a 
task's CPU mask covers all the CPUs in the system, causing the task to 
potentially be migrated to or woken up on CPUs 0-7. Commit 23d04d8 
resolves a similar issue in task wakeup.


The other is the isolated cpuset partition:
  mkdir blue
  echo 5-8 > blue/cpuset.cpus
  echo "isolated" > blue/cpuset.cpus.partition

The 5-8 cpus now is also a isolated domain, but 5-8 would not be 
excluded from housekeeping_cpumask. The tasks' cpumask in blue cgroup 
would be restricted to 5-8.

I thinks in update_numa_stats(), only using 
housekeeping_cpumask(HK_TYPE_DOMAIN) is not enough because it cannot 
skip those isolated cpuset partitions.

Thanks.