[PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()

Chen Yu posted 3 patches 2 years ago
[PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years ago
Problem statement:
When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Proposal:
Introduce the SIS_CACHE. It considers the sleep time of the task for
better task placement. Based on the task's short sleeping history,
tag p's previous CPU as cache-hot. Later when p is woken up, it can
choose its previous CPU in select_idle_sibling(). When other task is
woken up, skip this cache-hot idle CPU when possible.

SIS_CACHE still prefers to choose an idle CPU during task wakeup,
the idea is to optimize the idle CPU scan sequence.

As pointed out by Prateek, this has the potential that all idle CPUs
are cache-hot and skipped. Mitigate this by returning the first
cache-hot idle CPU. Meanwhile, to reduce the time spend on scanning,
limit the max number of cache-hot CPU search depth to half of the number
suggested by SIS_UTIL.

Tested on Xeon 2 x 60C/120T platforms:

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	60-threads	 1.00 (  1.37)	 +0.04 (  1.47)
TCP_RR          	120-threads	 1.00 (  1.77)	 -1.03 (  1.31)
TCP_RR          	180-threads	 1.00 (  2.03)	 +1.25 (  1.66)
TCP_RR          	240-threads	 1.00 ( 41.31)	+73.71 ( 22.02)
TCP_RR          	300-threads	 1.00 ( 12.79)	 -0.11 ( 15.84)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	60-threads	 1.00 (  0.35)	 +0.40 (  0.31)
loopback        	120-threads	 1.00 (  1.94)	 -1.89 (  1.17)
loopback        	180-threads	 1.00 ( 13.59)	 +9.97 (  0.93)
loopback        	240-threads	 1.00 ( 11.68)	+42.85 (  7.28)
loopback        	300-threads	 1.00 (  4.47)	+15.12 (  1.40)

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  9.21)	 -7.88 (  2.03)
process-pipe    	2-groups	 1.00 (  7.09)	 +5.47 (  9.02)
process-pipe    	4-groups	 1.00 (  1.60)	 +1.53 (  1.70)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.98)	 +0.26 (  0.37)
normal          	2-mthreads	 1.00 (  3.99)	 -7.97 (  7.33)
normal          	4-mthreads	 1.00 (  3.07)	 -1.55 (  3.27)

Also did some experiments on the OLTP workload on a 112 core 2 socket
SPR machine. The OLTP workload have a mixture of threads handling
database updates on disks and handling transaction queries over network.
Around 0.7% improvement is observed with less than 0.2% run-to-run
variation.

Thanks Madadi for testing the SIS_CACHE on a power system with 96 CPUs.
The results showed a max of 29% improvements in hackbench, 13% improvements
in producer_consumer workload, and 2% improvements in real life workload
named Daytrader.

Thanks Prateek for running microbenchmarks on top of the latest patch on
a 3rd Generation EPYC System:
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
No consistent regression was observed in v2 version.

Analysis:
The reason why waking up the task on its previous CPU brings benefits
is due to less task migration and higher local resource locality.

Take netperf 240 case as an example, run the following script
to track the migration number within 10 seconds. Use perf topdown
to track the PMU events. The task migration and stall cycles
have been reduced a lot with SIS_CACHE:

kretfunc:select_task_rq_fair
{
        $p = (struct task_struct *)args->p;
        if ($p->comm == "netperf") {
                if ($p->thread_info.cpu != retval) {
                        @wakeup_migrate_netperf = count();
                } else {
                        @wakeup_prev_netperf = count();
                }
        }
}

NO_SIS_CACHE:
@wakeup_migrate_netperf: 57473509
@wakeup_prev_netperf: 14935964
RESOURCE_STALLS: 19.1% * 7.1% * 35.0%

SIS_CACHE:
@wakeup_migrate_netperf: 799
@wakeup_prev_netperf: 132937436
RESOURCE_STALLS: 5.4% * 7.5% * 39.8%

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c309b3d203c0..d149eca74fca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7360,7 +7360,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
  * Return true if the idle CPU is cache-hot for someone,
  * return false otherwise.
  */
-static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
+static bool cache_hot_cpu(int cpu, int *hot_cpu)
 {
 	if (!sched_feat(SIS_CACHE))
 		return false;
@@ -7383,7 +7383,7 @@ static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
-	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	int i, cpu, idle_cpu = -1, nr = INT_MAX, nr_hot = 0, hot_cpu = -1;
 	struct sched_domain_shared *sd_share;
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -7396,6 +7396,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 			/* overloaded LLC is unlikely to have idle cpu/core */
 			if (nr == 1)
 				return -1;
+
+			/* max number of cache-hot idle cpu check */
+			nr_hot = nr >> 1;
 		}
 	}
 
@@ -7426,18 +7429,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
-			if ((unsigned int)i < nr_cpumask_bits)
+			if ((unsigned int)i < nr_cpumask_bits) {
+				if (--nr_hot >= 0 && cache_hot_cpu(i, &hot_cpu))
+					continue;
+
 				return i;
+			}
 
 		} else {
 			if (--nr <= 0)
 				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
-			if ((unsigned int)idle_cpu < nr_cpumask_bits)
+			if ((unsigned int)idle_cpu < nr_cpumask_bits) {
+				if (--nr_hot >= 0 && cache_hot_cpu(idle_cpu, &hot_cpu))
+					continue;
+
 				break;
+			}
 		}
 	}
 
+	/* pick the first cache-hot CPU as the last resort */
+	if (idle_cpu == -1 && hot_cpu != -1)
+		idle_cpu = hot_cpu;
+
 	if (has_idle_core)
 		set_idle_cores(target, false);
 
-- 
2.25.1
Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()
Posted by Hillf Danton 1 year, 10 months ago
On Tue, 21 Nov 2023 15:40:14 +0800 Chen Yu <yu.c.chen@intel.com>
> Problem statement:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Proposal:
> Introduce the SIS_CACHE. It considers the sleep time of the task for
> better task placement. Based on the task's short sleeping history,
> tag p's previous CPU as cache-hot. Later when p is woken up, it can
> choose its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU when possible.
> 
> SIS_CACHE still prefers to choose an idle CPU during task wakeup,
> the idea is to optimize the idle CPU scan sequence.

Could you specify why the currently selected cpu fails to work in the
scenario described above?

	/*
	 * If the previous CPU is cache affine and idle, don't be stupid:
	 */
	if (prev != target && cpus_share_cache(prev, target) &&
	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
	    asym_fits_cpu(task_util, util_min, util_max, prev))
		return prev;
Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()
Posted by Chen Yu 1 year, 10 months ago
Hi Hillf,

On 2024-02-19 at 19:50:14 +0800, Hillf Danton wrote:
> On Tue, 21 Nov 2023 15:40:14 +0800 Chen Yu <yu.c.chen@intel.com>
> > Problem statement:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Proposal:
> > Introduce the SIS_CACHE. It considers the sleep time of the task for
> > better task placement. Based on the task's short sleeping history,
> > tag p's previous CPU as cache-hot. Later when p is woken up, it can
> > choose its previous CPU in select_idle_sibling(). When other task is
> > woken up, skip this cache-hot idle CPU when possible.
> > 
> > SIS_CACHE still prefers to choose an idle CPU during task wakeup,
> > the idea is to optimize the idle CPU scan sequence.
> 
> Could you specify why the currently selected cpu fails to work in the
> scenario described above?
>

Thank you for your review.

I assume that "currently select cpu" means "target". First, the "target"
could be chosen by wake_affine() -> wake_affine_weight(), which can
return non-idle candidate CPU. That is to say, when we reached below code,
the target and the prev could both be non-idle. Second, when target and
prev are both non-idle, select_idle_sibling() has to traverse the other CPUs
to find an idle one. What we do in SIS_CACHE is to increase the possibility
that the prev is idle and return directly in below code path.

Say, task p1's previous CPU is CPU1, and p1 is sleeping. With SIS_CACHE,
when another task p2 is woken up in select_idle_sibling()->select_idle_cpu(),
p2 tries to find CPU2 or CPU3 or CPU4... but not CPU1. This makes it easier
for p1 to find that CPU1 is idle when p1 is woken up, and choose CPU1 when
possible.

thanks,
Chenyu
> 	/*
> 	 * If the previous CPU is cache affine and idle, don't be stupid:
> 	 */
> 	if (prev != target && cpus_share_cache(prev, target) &&
> 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> 	    asym_fits_cpu(task_util, util_min, util_max, prev))
> 		return prev;
Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()
Posted by Madadi Vineeth Reddy 2 years ago
Hi Chen Yu,

On 21/11/23 13:10, Chen Yu wrote:
> Problem statement:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Proposal:
> Introduce the SIS_CACHE. It considers the sleep time of the task for
> better task placement. Based on the task's short sleeping history,
> tag p's previous CPU as cache-hot. Later when p is woken up, it can
> choose its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU when possible.
> 
> SIS_CACHE still prefers to choose an idle CPU during task wakeup,
> the idea is to optimize the idle CPU scan sequence.
> 
> As pointed out by Prateek, this has the potential that all idle CPUs
> are cache-hot and skipped. Mitigate this by returning the first
> cache-hot idle CPU. Meanwhile, to reduce the time spend on scanning,
> limit the max number of cache-hot CPU search depth to half of the number
> suggested by SIS_UTIL.
> 
> Tested on Xeon 2 x 60C/120T platforms:
> 
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	60-threads	 1.00 (  1.37)	 +0.04 (  1.47)
> TCP_RR          	120-threads	 1.00 (  1.77)	 -1.03 (  1.31)
> TCP_RR          	180-threads	 1.00 (  2.03)	 +1.25 (  1.66)
> TCP_RR          	240-threads	 1.00 ( 41.31)	+73.71 ( 22.02)
> TCP_RR          	300-threads	 1.00 ( 12.79)	 -0.11 ( 15.84)
> 
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	60-threads	 1.00 (  0.35)	 +0.40 (  0.31)
> loopback        	120-threads	 1.00 (  1.94)	 -1.89 (  1.17)
> loopback        	180-threads	 1.00 ( 13.59)	 +9.97 (  0.93)
> loopback        	240-threads	 1.00 ( 11.68)	+42.85 (  7.28)
> loopback        	300-threads	 1.00 (  4.47)	+15.12 (  1.40)
> 
> hackbench
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1-groups	 1.00 (  9.21)	 -7.88 (  2.03)
> process-pipe    	2-groups	 1.00 (  7.09)	 +5.47 (  9.02)
> process-pipe    	4-groups	 1.00 (  1.60)	 +1.53 (  1.70)
> 
> schbench
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1-mthreads	 1.00 (  0.98)	 +0.26 (  0.37)
> normal          	2-mthreads	 1.00 (  3.99)	 -7.97 (  7.33)
> normal          	4-mthreads	 1.00 (  3.07)	 -1.55 (  3.27)
> 
> Also did some experiments on the OLTP workload on a 112 core 2 socket
> SPR machine. The OLTP workload have a mixture of threads handling
> database updates on disks and handling transaction queries over network.
> Around 0.7% improvement is observed with less than 0.2% run-to-run
> variation.
> 
> Thanks Madadi for testing the SIS_CACHE on a power system with 96 CPUs.
> The results showed a max of 29% improvements in hackbench, 13% improvements
> in producer_consumer workload, and 2% improvements in real life workload
> named Daytrader.
> 
> Thanks Prateek for running microbenchmarks on top of the latest patch on
> a 3rd Generation EPYC System:
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> No consistent regression was observed in v2 version.
> 
> Analysis:
> The reason why waking up the task on its previous CPU brings benefits
> is due to less task migration and higher local resource locality.
> 
> Take netperf 240 case as an example, run the following script
> to track the migration number within 10 seconds. Use perf topdown
> to track the PMU events. The task migration and stall cycles
> have been reduced a lot with SIS_CACHE:
> 
> kretfunc:select_task_rq_fair
> {
>         $p = (struct task_struct *)args->p;
>         if ($p->comm == "netperf") {
>                 if ($p->thread_info.cpu != retval) {
>                         @wakeup_migrate_netperf = count();
>                 } else {
>                         @wakeup_prev_netperf = count();
>                 }
>         }
> }
> 
> NO_SIS_CACHE:
> @wakeup_migrate_netperf: 57473509
> @wakeup_prev_netperf: 14935964
> RESOURCE_STALLS: 19.1% * 7.1% * 35.0%
> 
> SIS_CACHE:
> @wakeup_migrate_netperf: 799
> @wakeup_prev_netperf: 132937436
> RESOURCE_STALLS: 5.4% * 7.5% * 39.8%
> 
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/fair.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c309b3d203c0..d149eca74fca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7360,7 +7360,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
>   * Return true if the idle CPU is cache-hot for someone,
>   * return false otherwise.
>   */
> -static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
> +static bool cache_hot_cpu(int cpu, int *hot_cpu)
>  {
>  	if (!sched_feat(SIS_CACHE))
>  		return false;
> @@ -7383,7 +7383,7 @@ static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> -	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +	int i, cpu, idle_cpu = -1, nr = INT_MAX, nr_hot = 0, hot_cpu = -1;
>  	struct sched_domain_shared *sd_share;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -7396,6 +7396,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  			/* overloaded LLC is unlikely to have idle cpu/core */
>  			if (nr == 1)
>  				return -1;
> +
> +			/* max number of cache-hot idle cpu check */
> +			nr_hot = nr >> 1;
>  		}
>  	}
>  
> @@ -7426,18 +7429,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> -			if ((unsigned int)i < nr_cpumask_bits)
> +			if ((unsigned int)i < nr_cpumask_bits) {
> +				if (--nr_hot >= 0 && cache_hot_cpu(i, &hot_cpu))
> +					continue;
> +
>  				return i;
> +			}
>  
>  		} else {
>  			if (--nr <= 0)
>  				return -1;
>  			idle_cpu = __select_idle_cpu(cpu, p);
> -			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +			if ((unsigned int)idle_cpu < nr_cpumask_bits) {
> +				if (--nr_hot >= 0 && cache_hot_cpu(idle_cpu, &hot_cpu))
> +					continue;
> +
>  				break;
> +			}
>  		}
>  	}
>  
> +	/* pick the first cache-hot CPU as the last resort */
> +	if (idle_cpu == -1 && hot_cpu != -1)
> +		idle_cpu = hot_cpu;
> +
>  	if (has_idle_core)
>  		set_idle_cores(target, false);
>  

As per my understanding, if the task which tagged a particular CPU as cache hot has woken up before
the cache_hot_timeout, we still don't allow that task to run on that particular CPU. Is this correct?

If correct, then why don't we let the task to select the CPU if it's the one that tagged it?

Thanks and Regards
Madadi Vineeth Reddy
Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years ago
Hi Madadi,

On Thu, Nov 30, 2023 at 1:26 AM Madadi Vineeth Reddy
<vineethr@linux.ibm.com> wrote:
>
> Hi Chen Yu,
>
[snip...]

>
> As per my understanding, if the task which tagged a particular CPU as cache hot has woken up before
> the cache_hot_timeout, we still don't allow that task to run on that particular CPU. Is this correct?
>

Not exactly. When we reached select_idle_cpu(), we have already
checked if the wakee's previous CPU
is idle or not in select_idle_sibling(), if it is idle, we don't check
the cache-hot tag and return the wakee's
previous CPU directly in select_idle_sibling().

thanks,
Chenyu

> If correct, then why don't we let the task to select the CPU if it's the one that tagged it?
>
> Thanks and Regards
> Madadi Vineeth Reddy
Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()
Posted by Madadi Vineeth Reddy 2 years ago
On 30/11/23 12:13, Chen Yu wrote:
> Hi Madadi,
> 
> On Thu, Nov 30, 2023 at 1:26 AM Madadi Vineeth Reddy
> <vineethr@linux.ibm.com> wrote:
>>
>> Hi Chen Yu,
>>
> [snip...]
> 
>>
>> As per my understanding, if the task which tagged a particular CPU as cache hot has woken up before
>> the cache_hot_timeout, we still don't allow that task to run on that particular CPU. Is this correct?
>>
> 
> Not exactly. When we reached select_idle_cpu(), we have already
> checked if the wakee's previous CPU
> is idle or not in select_idle_sibling(), if it is idle, we don't check
> the cache-hot tag and return the wakee's
> previous CPU directly in select_idle_sibling().

Yeah..got it. Thanks.

So, the way another task(t') can select the cache hot tagged cpu(tagged for task t) is when t' gets 
it's target cpu as this cache hot tagged cpu from wake_affine() in select_task_rq_fair. 
The other way being /* pick the first cache-hot CPU as the last resort */.

> 
> thanks,
> Chenyu
> 
>> If correct, then why don't we let the task to select the CPU if it's the one that tagged it?
>>
>> Thanks and Regards
>> Madadi Vineeth Reddy

Thanks and Regards
Madadi Vineeth Reddy