[v1] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

[PATCH] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Posted by K Prateek Nayak 4 years, 4 months ago

Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
new-task wakeup strategy when the tasks are affined to a subset of cpus
which can result in piling of tasks on the same set of CPU in a NUMA
group despite there being other cpus in a different NUMA group where the
task could have run in. A good placement makes a difference especially
in case of short lived task where the delay in load balancer kicking in
can cause degradation in perfromance.

Benchmark is performed using pinned run of STREAM, parallelized with OMP,
on a Zen3 machine. STREAM is configured to run 8 threads with CPU affinity
set to cpus 0,16,32,48,64,80,96,112.
This ensures an even distribution of allowed cpus across the NUMA groups
in NPS1, NPS2 and NPS4 modes.
The script running the stream itself is pinned to cpus 8-15 to maintain
consistency across runs and to make sure the script runs on an LLC
not part of stream cpulist so as to not interfere with the benchmark.

Changes are based on top of v5 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs" [1]

Following are the results:

	 5.17.0-rc1              5.17.0-rc1              5.17.0-rc1
	 tip sched/core          tip sched/core          tip sched/core
				 + mel-v5                + mel-v5
							 + this-fix

NPS Mode - NPS1

 Copy:	 92133.36 (0.00 pct)	 117595.57 (27.63 pct)	 150655.69 (63.51 pct)
Scale:	 90847.82 (0.00 pct)	 114525.59 (26.06 pct)	 148939.47 (63.94 pct)
  Add:	 103585.04 (0.00 pct)	 137548.40 (32.78 pct)	 186323.75 (79.87 pct)
Triad:	 100060.42 (0.00 pct)	 133695.34 (33.61 pct)	 184203.97 (84.09 pct)

NPS Mode - NPS 2

 Copy:	 52969.12 (0.00 pct)	 76969.90 (45.31 pct)	 165892.91 (213.18 pct)
Scale:	 49209.91 (0.00 pct)	 69200.05 (40.62 pct)	 152210.69 (209.30 pct)
  Add:	 60106.69 (0.00 pct)	 92049.47 (53.14 pct)	 195135.02 (224.64 pct)
Triad:	 60052.66 (0.00 pct)	 88323.03 (47.07 pct)	 193672.59 (222.50 pct)

NPS Mode - NPS4

 Copy:	 44122.00 (0.00 pct)	 157154.70 (256.18 pct)	 169755.52 (284.74 pct)
Scale:	 41730.68 (0.00 pct)	 172303.88 (312.89 pct)	 170247.06 (307.96 pct)
  Add:	 51666.98 (0.00 pct)	 214293.71 (314.75 pct)	 213560.61 (313.34 pct)
Triad:	 50489.87 (0.00 pct)	 212242.49 (320.36 pct)	 210844.58 (317.59 pct)

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in mel-v5 in NPS2 mode:

stream-4578    [016] d..2.    81.970702: sched_wakeup_new: comm=stream pid=4580 prio=120 target_cpu=000
stream-4578    [016] d..2.    81.970760: sched_wakeup_new: comm=stream pid=4581 prio=120 target_cpu=016
stream-4578    [016] d..2.    81.970823: sched_wakeup_new: comm=stream pid=4582 prio=120 target_cpu=048
stream-4578    [016] d..2.    81.970875: sched_wakeup_new: comm=stream pid=4583 prio=120 target_cpu=032
stream-4578    [016] d..2.    81.970920: sched_wakeup_new: comm=stream pid=4584 prio=120 target_cpu=016
stream-4578    [016] d..2.    81.970961: sched_wakeup_new: comm=stream pid=4585 prio=120 target_cpu=016
stream-4578    [016] d..2.    81.971039: sched_wakeup_new: comm=stream pid=4586 prio=120 target_cpu=112

Three stream threads pile up on cpu 16 initially and in case of
short runs, where the load balancer doesn't have enough time to kick in
to migrate task, performance might suffer. This pattern is observed
consistently where tasks pile on one cpu of the group where the
runner script is pinned to.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks with this fix in NPS2 mode:

stream-4639    [032] d..2.   102.903581: sched_wakeup_new: comm=stream pid=4641 prio=120 target_cpu=016
stream-4639    [032] d..2.   102.903698: sched_wakeup_new: comm=stream pid=4642 prio=120 target_cpu=048
stream-4639    [032] d..2.   102.903762: sched_wakeup_new: comm=stream pid=4643 prio=120 target_cpu=080
stream-4639    [032] d..2.   102.903823: sched_wakeup_new: comm=stream pid=4644 prio=120 target_cpu=112
stream-4639    [032] d..2.   102.903879: sched_wakeup_new: comm=stream pid=4645 prio=120 target_cpu=096
stream-4639    [032] d..2.   102.903938: sched_wakeup_new: comm=stream pid=4646 prio=120 target_cpu=000
stream-4639    [032] d..2.   102.903991: sched_wakeup_new: comm=stream pid=4647 prio=120 target_cpu=064

The tasks have been distributed evenly across all the allowed cpus
and no pile up can be seen.

Aggressive NUMA balancing is only done when needed. We select the
minimum of number of allowed cpus in sched group and the calculated
sd.imb_numa_nr as our imbalance threshold and the default behavior
of mel-v5 is only modified when the former is smaller than
the latter.

This can help is case of embarrassingly parallel programs with tight
cpus affinity set.

[1] https://lore.kernel.org/lkml/20220203144652.12540-1-mgorman@techsingularity.net/

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/core.c | 3 +++
 kernel/sched/fair.c | 7 ++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d863d7f6ad7..9a92ac42bb24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9294,6 +9294,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, find_idlest_group_mask);
 
 void __init sched_init(void)
 {
@@ -9344,6 +9345,8 @@ void __init sched_init(void)
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 		per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
 			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+		per_cpu(find_idlest_group_mask, i) = (cpumask_var_t)kzalloc_node(
+			cpumask_size(), GFP_KERNEL, cpu_to_node(i));
 	}
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index babf3b65db38..ffced741b244 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5751,6 +5751,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Working cpumask for: load_balance, load_balance_newidle. */
 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
 DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, find_idlest_group_mask);
 
 #ifdef CONFIG_NO_HZ_COMMON
 
@@ -9022,6 +9023,7 @@ static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(find_idlest_group_mask);
 	struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
 	struct sg_lb_stats local_sgs, tmp_sgs;
 	struct sg_lb_stats *sgs;
@@ -9130,6 +9132,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 
 	case group_has_spare:
 		if (sd->flags & SD_NUMA) {
+			int imb;
 #ifdef CONFIG_NUMA_BALANCING
 			int idlest_cpu;
 			/*
@@ -9150,7 +9153,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * allowed. If there is a real need of migration,
 			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+			cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+			imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
 				return NULL;
 		}
 
-- 
2.25.1

Re: [PATCH] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Posted by Mel Gorman 4 years, 4 months ago

On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
> new-task wakeup strategy when the tasks are affined to a subset of cpus
> which can result in piling of tasks on the same set of CPU in a NUMA
> group despite there being other cpus in a different NUMA group where the
> task could have run in. A good placement makes a difference especially
> in case of short lived task where the delay in load balancer kicking in
> can cause degradation in perfromance.
> 

Thanks.

V6 was posted based on previous feedback. While this patch is building
on top of it, please add Acked-by or Tested-by if the imbalance series
helps the general problem of handling imbalances when there are multiple
last level caches.

> <SNIP>
>
> Aggressive NUMA balancing is only done when needed. We select the
> minimum of number of allowed cpus in sched group and the calculated
> sd.imb_numa_nr as our imbalance threshold and the default behavior
> of mel-v5 is only modified when the former is smaller than
> the latter.
> 

In this context, it should be safe to reuse select_idle_mask like this
build tested patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 538756bd8e7f..1e759c21371b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9128,6 +9128,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 
 	case group_has_spare:
 		if (sd->flags & SD_NUMA) {
+			struct cpumask *cpus;
+			int imb;
 #ifdef CONFIG_NUMA_BALANCING
 			int idlest_cpu;
 			/*
@@ -9145,10 +9147,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * Otherwise, keep the task close to the wakeup source
 			 * and improve locality if the number of running tasks
 			 * would remain below threshold where an imbalance is
-			 * allowed. If there is a real need of migration,
-			 * periodic load balance will take care of it.
+			 * allowed while accounting for the possibility the
+			 * task is pinned to a subset of CPUs.  If there is a
+			 * real need of migration, periodic load balance will
+			 * take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+			cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+			cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+			imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
 				return NULL;
 		}

Re: [PATCH] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Posted by K Prateek Nayak 4 years, 4 months ago

Hello Mel,

Thank you for taking a look at the patch.

On 2/8/2022 4:21 PM, Mel Gorman wrote:
> On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
>> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
>> new-task wakeup strategy when the tasks are affined to a subset of cpus
>> which can result in piling of tasks on the same set of CPU in a NUMA
>> group despite there being other cpus in a different NUMA group where the
>> task could have run in. A good placement makes a difference especially
>> in case of short lived task where the delay in load balancer kicking in
>> can cause degradation in perfromance.
>>
> Thanks.
>
> V6 was posted based on previous feedback. While this patch is building
> on top of it, please add Acked-by or Tested-by if the imbalance series
> helps the general problem of handling imbalances when there are multiple
> last level caches.

Yes, the imbalance series does a good job handling the general imbalance
problem in case of systems with multiple LLCs. This patch builds on top of
your effort to balance more aggressively in certain scenarios arising when
tasks are pinned to a subset of CPUs.
I'll run benchmarks against v6 and ack the results on the imbalance patchset.

>> <SNIP>
>>
>> Aggressive NUMA balancing is only done when needed. We select the
>> minimum of number of allowed cpus in sched group and the calculated
>> sd.imb_numa_nr as our imbalance threshold and the default behavior
>> of mel-v5 is only modified when the former is smaller than
>> the latter.
>>
> In this context, it should be safe to reuse select_idle_mask like this
> build tested patch
Thank you for pointing this out. I'll make the changes in the follow up.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 538756bd8e7f..1e759c21371b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9128,6 +9128,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  
>  	case group_has_spare:
>  		if (sd->flags & SD_NUMA) {
> +			struct cpumask *cpus;
> +			int imb;
>  #ifdef CONFIG_NUMA_BALANCING
>  			int idlest_cpu;
>  			/*
> @@ -9145,10 +9147,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			 * Otherwise, keep the task close to the wakeup source
>  			 * and improve locality if the number of running tasks
>  			 * would remain below threshold where an imbalance is
> -			 * allowed. If there is a real need of migration,
> -			 * periodic load balance will take care of it.
> +			 * allowed while accounting for the possibility the
> +			 * task is pinned to a subset of CPUs.  If there is a
> +			 * real need of migration, periodic load balance will
> +			 * take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> +			cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +			cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> +			imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
>  				return NULL;
>  		}
>  

Thank you for the feedback and suggestions.

Thanks and Regards
Prateek

Re: [PATCH] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Posted by Peter Zijlstra 4 years, 4 months ago

On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
> new-task wakeup strategy when the tasks are affined to a subset of cpus
> which can result in piling of tasks on the same set of CPU in a NUMA
> group despite there being other cpus in a different NUMA group where the
> task could have run in. 

Where does this affinity come from?

Re: [PATCH] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Posted by K Prateek Nayak 4 years, 4 months ago

Hello Peter,

On 2/9/2022 4:16 PM, Peter Zijlstra wrote:
> On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
>> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
>> new-task wakeup strategy when the tasks are affined to a subset of cpus
>> which can result in piling of tasks on the same set of CPU in a NUMA
>> group despite there being other cpus in a different NUMA group where the
>> task could have run in. 
> Where does this affinity come from?

The affinity comes from limiting the process to run on certain subset
of available cpus by modifying the cpus_ptr member of task_struck
of process via taskset or numactl.

---
Thanks and Regards
Prateek